error when using time in rolling function pandas - python

I am trying to calculate mean i.e moving average every 10sec of data; lets say 1 to 10sec, and 11sec to 20sec etc.
Is below right for this? I am getting error when using "60sec" in rolling function, I think it may be due to the "ltt" column which is of type string, I am converting it to datetime, but still the error is coming.
How to resolve this error? Also how to do the averaging for samples collected every 10sec. This is streaming data coming in, but for testing purpose, I am using the static data in record1.
import pandas as pd
import numpy as np
records1 = [
{'ltt': 'Mon Nov 7 12:12:05 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:05 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:07 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:08 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:09 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:10 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:11 2022', 'last': 261},
{'ltt': 'Mon Nov 7 12:12:12 2022', 'last': 262},
{'ltt': 'Mon Nov 7 12:12:12 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:14 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:15 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:16 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:17 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:18 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:19 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:20 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:21 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:22 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:23 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:24 2022', 'last': 260}
]
datalist = []
def strategy1(record):
global datalist
datalist.append(record)
pandas_df = pd.DataFrame(datalist)
pandas_df['ltt'] = pd.to_datetime(pandas_df['ltt'], format="%a %b %d %H:%M:%S %Y")
pandas_df['hour'] = pandas_df['ltt'].dt.hour
pandas_df['minute'] = pandas_df['ltt'].dt.minute
pandas_df['second'] = pandas_df['ltt'].dt.second
pandas_df['max'] = pandas_df.groupby('second')['last'].transform("max")
pandas_df["ma_1min"] = (
pandas_df.sort_values("ltt")
.groupby(["hour", "minute"])["last"]
.transform(lambda x: x.rolling('10sec', min_periods=1).mean())
)
print(pandas_df)

i don't know how to exactly implement this in your code but i had a kind of similar problem where i had to group each day into 4 hour timeslots. so an approach might be something like this:
pandas_df.groupby([pandas_df['ltt'].dt.hour, pandas_df['ltt'].dt.minute, (pandas_df['ltt'].dt.second / 10).astype(int)]).last.agg('mean')
this should basically give you 6 groups ([0s-9s -> 0], [10s-19s -> 1], etc. for the 3rd groupby index) for every minute of data.

Related

Text mining on string dates from fuzzy example

I'm a bit stuck with proceeding with the custom function that would be able to parse in any given language, with a set of rules that would be able to identify and parse months and years including some special cases (that I will show).
Input data:
dates_eg = [["Issued Feb 2018 · No Expiration Date"], ["Oct 2021 - Present"],
["Jan 2019 - Oct 2021 · 2 yrs 10 mos"], ["1994 - 2000"], ["Sep 2010 – Sep 2015 • 5 yrs"],
["Nov 2019 – Present"], ["Sep 2015 – Mar 2017 • 1 yr 6 mos"], ["Apr 2019 – Present • 2 yrs 8 mos"],
["Issued: Aug 2018 · Expired Aug 2020"], ["Mar 2017 - Dec 2017 · 10 mos"],
["May 2021 - Present · 9 mos"]]
NOTE: No API translation should be used. Instead of it, I've created a list of FLAG words (in certain languages) and STOPWORDS that would be able to identify all the cases.
Present, Issued, Expired, No Expiration Date -> all of these four keywords/stopwords would be identified through the lists (see underneath).
Current functions work workflow:
e.g. "Issued Feb 2018 · No Expiration Date"
Firstly, algorithm named "raw_dates_extractor" is separating the start date ('Issued Feb 2018') and the end date ('No Expiration Date') with a separator ' · '
Secondly, clean_extract_dates algorithm is identifying 'Issued' and gets rid of it
All the aforementioned resulting in:
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
However, the desired output would go further by extracting/identifying years, months, and stopwords/flag words (like 'No Expiration Date', 'Present' would be flagged as 'ONGOING' = 1 otherwise 0.
From the first initial raw example:
"Issued Feb 2018 · No Expiration Date" would be:
Desired result:
{'start_year': 2021, 'start_month': "Apr", 'end_year': None,
'end_month': None, 'is_on_going': True})
Complete Code attempt:
Lists with Expired, Issued, No Expiration Date, Present
import re
import pandas as pd
# "Present" word in all language combination
present_all_lang = ["současnost", "I dag", "Heute", "Present", "actualidad", "aujourd’hui", "वर्तमान"
"Saat ini", "Presente",
"現在", "현재", "Kini", "heden", "Nå", "obecnie", "o momento"
"Prezent", "настоящее время", "nu", "ปัจจุบัน ",
"Kasalukuyan", "Halen", "至今", "現在"]
# "Issued" word in all language combination
issued_all_lang = ["Vydáno", "Udstedt", "Ausgestellt", "Issued", "Expedición", "Émise le", "नव॰", "Diterbitkan"
"Data di rilascio",
"発行日", "발행일", "Dikeluarkan", "Toegekend", "Utstedt", "Wydany", "Emitido em"
"Дата выдачи", "Utfärdat", "ออกเมื่อ",
"Inisyu", "tarihinde verildi", "颁发日期", "發照日期"]
# "No experience date" in all language combination
no_experience_all_lang = ["Bez data vypršení", "Ingen udløbsdato", "Kein Ablaufdatum", "No Expiration Date",
"Sin fecha de vencimiento", "Sans date d’expiration", "समाप्ति की कोई तारीख नहीं",
"Tidak Ada Tanggal Kedaluwarsa", "Nessuna data di scadenza", "有効期限なし"
"만료일 없음",
"Tiada Tarikh Tamat Tempoh", "Geen vervaldatum" "Ingen utløpsdato",
"Brak daty ważności", "Sem data de expiração", "Без истечения срока действия"
"Inget utgångsdatum", "ไม่มีวันหมดอายุ",
"Walang Petsa ng Pagtatapos"
"Sona Erme Tarihi Yok", "长期有效", "永久有效"]
# "Expired" in all lang
expires_all_lang = ["Vyprší", "Udløbet", "Gültig bis", "Expired", "Vencimiento", "Expire le",
"को समाप्त हो गया है", "Kedaluwarsa", "Data di scadenza", "有効期限", " 만료됨", "Tamat Tempoh",
"Verloopt op", "Utløper", "Wygasa", "Expira em", "A expirat la", "Срок действия истек",
"Upphörde att gälla", "หมดอายุเมื่อ", "Mag-e-expire sa", "tarihinde süresi bitiyor", "有效期至", "過期"]
# colon my appear get rid of it!
def raw_dates_extractor(list_of_examples: dates_eg) -> dict:
raw_start_date, raw_end_date, merged_start_end_dates = {}, {}, {}
for row, datum in enumerate(list_of_examples):
date_splitter = re.split("[·•–-]", datum[0])
raw_start_date[f'Start_date_raw{row}'] = date_splitter[0].strip(' ')
raw_end_date[f'End_date_raw{row}'] = date_splitter[1].strip(' ')
merged_start_end_dates = raw_start_date | raw_end_date
return merged_start_end_dates
output:
raw_dates_extractor(dates_eg)
{'Start_date_raw0': 'Issued Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Issued: Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Expired Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
def clean_extract_dates(input_dictionary: dict, aggregate_stop_words: list):
container_replica = {}
for date_element in input_dictionary:
query_req = input_dictionary[f'{date_element}']
if ":" in query_req:
query_req = query_req.replace(":", "")
if "." in query_req:
query_req = query_req.replace(".", "")
stopwords = aggregate_stop_words
query_words = query_req.split()
result_words = [word for word in query_words if word not in stopwords]
end_result = ' '.join(result_words)
container_replica[f'{date_element}'] = end_result
return container_replica
output:
clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)
{'Start_date_raw0': 'Feb 2018', 'Start_date_raw1': 'Oct 2021', 'Start_date_raw2': 'Jan 2019', 'Start_date_raw3': '1994', 'Start_date_raw4': 'Sep 2010', 'Start_date_raw5': 'Nov 2019', 'Start_date_raw6': 'Sep 2015', 'Start_date_raw7': 'Apr 2019', 'Start_date_raw8': 'Aug 2018', 'Start_date_raw9': 'Mar 2017', 'Start_date_raw10': 'May 2021', 'End_date_raw0': 'No Expiration Date', 'End_date_raw1': 'Present', 'End_date_raw2': 'Oct 2021', 'End_date_raw3': '2000', 'End_date_raw4': 'Sep 2015', 'End_date_raw5': 'Present', 'End_date_raw6': 'Mar 2017', 'End_date_raw7': 'Present', 'End_date_raw8': 'Aug 2020', 'End_date_raw9': 'Dec 2017', 'End_date_raw10': 'Present'}
aggregate_lists = issued_all_lang + no_experience_all_lang + expires_all_lang
saved_dictionary = clean_extract_dates(raw_dates_extractor(dates_eg), aggregate_lists)

Sort dictionary by dict value

I have such dict:
resp = {'1366451044687880192': {'created_at': 'Mon Mar 01 18:11:31 +0000 2021', 'id': 1366463640451233323}, '1366463640451256323': {'created_at': 'Mon Mar 05 19:01:34 +0000 2021', 'id': 1366463640451256323}}
Is it possible to sort it by created_at value?
Tried but doesnt work:
sorted(resp.values(), key=lambda item: item[0]['created_at'])
Try it online!
resp = {
'1366451044687880192': {
'created_at': 'Mon Mar 01 18:11:31 +0000 2021',
'id': 1366463640451233323
},
'1366463640451256323': {
'created_at': 'Mon Mar 05 19:01:34 +0000 2021',
'id': 1366463640451256323
}
}
print(sorted(resp.values(), key=lambda item: item['created_at']))
Output:
[
{'created_at': 'Mon Mar 01 18:11:31 +0000 2021', 'id': 1366463640451233323},
{'created_at': 'Mon Mar 05 19:01:34 +0000 2021', 'id': 1366463640451256323}
]
Or you can sort key-values (items) through (Try it online!):
sorted(resp.items(), key = lambda item: item[1]['created_at'])
which outputs:
[
('1366451044687880192', {'created_at': 'Mon Mar 01 18:11:31 +0000 2021', 'id': 1366463640451233323}),
('1366463640451256323', {'created_at': 'Mon Mar 05 19:01:34 +0000 2021', 'id': 1366463640451256323})
]

Removing substring from list of strings

I have the column of values as below,
array(['Mar 2018', 'Jun 2018', 'Sep 2018', 'Dec 2018', 'Mar 2019',
'Jun 2019', 'Sep 2019', 'Dec 2019', 'Mar 2020', 'Jun 2020',
'Sep 2020', 'Dec 2020'], dtype=object)
From this values I require output as,
array(['Mar'18', 'Jun'18', 'Sep'18', 'Dec'18', 'Mar'19',
'Jun'19', 'Sep'19', 'Dec'19', 'Mar'20', 'Jun'20',
'Sep'20', 'Dec'20'], dtype=object)
I have tried with following code,
df['Period'] = df['Period'].replace({'20','''})
But here it wasnt converting , how to replace the same?
Any help?
Thanks
With your shown samples, please try following.
df['Period'].replace(r" \d{2}", "'", regex=True)
Output will be as follows.
0 Mar'18
1 Jun'18
2 Sep'18
3 Dec'18
4 Mar'19
5 Jun'19
6 Sep'19
7 Dec'19
8 Mar'20
9 Jun'20
10 Sep'20
11 Dec'20
try this regex:
df['Period'].str.replace(r"\s\d{2}(\d{2})", r"'\1", regex=True)
in the replacement part, \1 refers to the capturing group, which is the last two digits in this case.
Following your code (slightly changed to work) will not get you what you need as it will replace all '20's.
>>> df['Period'] = df['Period'].str.replace('20','')
Out[179]:
Period
0 Mar 18
1 Jun 18
2 Sep 18
3 Dec 18
4 Mar 19
5 Jun 19
6 Sep 19
7 Dec 19
8 Mar
9 Jun
10 Sep
11 Dec
Another way without using regex, would be with with vectorized str methods, more here:
df['Period_refined'] = df['Period'].str[:3] + "'" + df['Period'].str[-2:]
Output
df
Period Period_refined
0 Mar 2018 Mar'18
1 Jun 2018 Jun'18
2 Sep 2018 Sep'18
3 Dec 2018 Dec'18
4 Mar 2019 Mar'19
5 Jun 2019 Jun'19
6 Sep 2019 Sep'19
7 Dec 2019 Dec'19
8 Mar 2020 Mar'20
9 Jun 2020 Jun'20
10 Sep 2020 Sep'20
11 Dec 2020 Dec'20

split element inside list python [duplicate]

This question already has answers here:
Is there a way to split a string by every nth separator in Python?
(6 answers)
Closed 4 years ago.
I was trying to split the element inside the list based on certain length, here is the list ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']. Could any one help me to retrieve the values from the list in the following format:
['Mar 18','Mar 17','Mar 16','Mar 15','Mar 14']
A regular expression based approach that would handle cases like Apr 1 or Dec 31 as well as multiple elements in the initial list:
import re
lst = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
[x for y in lst for x in re.findall(r'[A-Z][a-z]+ \d{1,2}', y)]
# ['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14']
list = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
span = 2
s = list[0].split(" ")
s = [" ".join(words[i:i+span]) for i in range(0, len(s), span)]
print(s)
For me, this prints
['Mar 18','Mar 17','Mar 16','Mar 15','Mar 14']
Taken from this answer.
Try this code !
You can do it by the concept of regular expression (just by import re library in python)
import re
lst = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14 Mar 2']
obj = re.findall(r'[A-Z][a-z]+[ ](?:\d{2}|\d{1})', lst[0])
print(obj)
Output :
['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14', 'Mar 2']
You can also try this one
>>> list = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
>>> result = list[0].split(" ")
>>> [i+' '+j for i,j in zip(result[::2], result[1::2])]
Output
['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14']

Concatenate ListA elements with partially matching ListB elements

Say I have two python lists as:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
I need to get an output as:
List_op = ['Jan 2018 Sales Jan 2018 Units sold Jan 2018','Feb 2018 Sales Feb 2018 Units sold Feb 2018','Mar 2018 Sales Mar 2018 Units sold Mar 2018']
My approach so far:
res=set()
for i in ListB:
for j in ListA:
if j in i:
res.add(f'{i} {j}')
print (res)
this gives me result as:
{'Units sold Jan 2018 Jan 2018', 'Sales Feb 2018 Feb 2018', 'Units sold Mar 2018 Mar 2018', 'Units sold Feb 2018 Feb 2018', 'Sales Jan 2018 Jan 2018', 'Sales Mar 2018 Mar 2018'}
which is definitely not the solution I'm looking for.
What I think is regular expression could be a handful here but I'm not sure how to approach. Any help in this regard is highly appreciated.
Thanks in advance.
Edit:
Values in ListA and ListB are not necessarily to be in order. Therefore for a particular month/year value in ListA, the same month/year value from ListB has to be matched and picked for both 'Sales' and 'Units sold' component and needs to be concatenated.
My main goal here is to get the list which I can use later to generate a statement that I'll be using to write Hive query.
Added more explanation as suggested by #andrew_reece
Assuming no additional edge cases that need taking care of, your original code is not bad, just needs a slight update:
List_op = []
for a in ListA:
combined = a
for b in ListB:
if a in b:
combined += " " + b
List_op.append(combined)
List_op
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Supposing ListA and ListB are sorted:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
print([v1 + " " + v2 for v1, v2 in zip(ListA, [v1 + " " + v2 for v1, v2 in zip(ListB[::2], ListB[1::2])])])
This will print:
['Jan 2018 Sales Jan 2018 Units sold Jan 2018', 'Feb 2018 Sales Feb 2018 Units sold Feb 2018', 'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
In my example I firstly concatenate ListB variables together and then join ListA with this new list.
String concatenation can become expensive. In Python 3.6+, you can use more efficient f-strings within a list comprehension:
res = [f'{i} {j} {k}' for i, j, k in zip(ListA, ListB[::2], ListB[1::2])]
print(res)
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Using itertools.islice, you can avoid the expense of creating new lists:
from itertools import islice
zipper = zip(ListA, islice(ListB, 0, None, 2), islice(ListB, 1, None, 2))
res = [f'{i} {j} {k}' for i, j, k in zipper]

Categories