Removing substring from list of strings

Removing substring from list of strings - python

I have the column of values as below,
array(['Mar 2018', 'Jun 2018', 'Sep 2018', 'Dec 2018', 'Mar 2019',
'Jun 2019', 'Sep 2019', 'Dec 2019', 'Mar 2020', 'Jun 2020',
'Sep 2020', 'Dec 2020'], dtype=object)
From this values I require output as,
array(['Mar'18', 'Jun'18', 'Sep'18', 'Dec'18', 'Mar'19',
'Jun'19', 'Sep'19', 'Dec'19', 'Mar'20', 'Jun'20',
'Sep'20', 'Dec'20'], dtype=object)
I have tried with following code,
df['Period'] = df['Period'].replace({'20','''})
But here it wasnt converting , how to replace the same?
Any help?
Thanks

With your shown samples, please try following.
df['Period'].replace(r" \d{2}", "'", regex=True)
Output will be as follows.
0 Mar'18
1 Jun'18
2 Sep'18
3 Dec'18
4 Mar'19
5 Jun'19
6 Sep'19
7 Dec'19
8 Mar'20
9 Jun'20
10 Sep'20
11 Dec'20

try this regex:
df['Period'].str.replace(r"\s\d{2}(\d{2})", r"'\1", regex=True)
in the replacement part, \1 refers to the capturing group, which is the last two digits in this case.

Following your code (slightly changed to work) will not get you what you need as it will replace all '20's.
>>> df['Period'] = df['Period'].str.replace('20','')
Out[179]:
Period
0 Mar 18
1 Jun 18
2 Sep 18
3 Dec 18
4 Mar 19
5 Jun 19
6 Sep 19
7 Dec 19
8 Mar
9 Jun
10 Sep
11 Dec
Another way without using regex, would be with with vectorized str methods, more here:
df['Period_refined'] = df['Period'].str[:3] + "'" + df['Period'].str[-2:]
Output
df
Period Period_refined
0 Mar 2018 Mar'18
1 Jun 2018 Jun'18
2 Sep 2018 Sep'18
3 Dec 2018 Dec'18
4 Mar 2019 Mar'19
5 Jun 2019 Jun'19
6 Sep 2019 Sep'19
7 Dec 2019 Dec'19
8 Mar 2020 Mar'20
9 Jun 2020 Jun'20
10 Sep 2020 Sep'20
11 Dec 2020 Dec'20

Related

error when using time in rolling function pandas

I am trying to calculate mean i.e moving average every 10sec of data; lets say 1 to 10sec, and 11sec to 20sec etc.
Is below right for this? I am getting error when using "60sec" in rolling function, I think it may be due to the "ltt" column which is of type string, I am converting it to datetime, but still the error is coming.
How to resolve this error? Also how to do the averaging for samples collected every 10sec. This is streaming data coming in, but for testing purpose, I am using the static data in record1.
import pandas as pd
import numpy as np
records1 = [
{'ltt': 'Mon Nov 7 12:12:05 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:05 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:07 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:08 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:09 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:10 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:11 2022', 'last': 261},
{'ltt': 'Mon Nov 7 12:12:12 2022', 'last': 262},
{'ltt': 'Mon Nov 7 12:12:12 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:14 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:15 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:16 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:17 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:18 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:19 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:20 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:21 2022', 'last': 260},
{'ltt': 'Mon Nov 7 12:12:22 2022', 'last': 258},
{'ltt': 'Mon Nov 7 12:12:23 2022', 'last': 259},
{'ltt': 'Mon Nov 7 12:12:24 2022', 'last': 260}
]
datalist = []
def strategy1(record):
global datalist
datalist.append(record)
pandas_df = pd.DataFrame(datalist)
pandas_df['ltt'] = pd.to_datetime(pandas_df['ltt'], format="%a %b %d %H:%M:%S %Y")
pandas_df['hour'] = pandas_df['ltt'].dt.hour
pandas_df['minute'] = pandas_df['ltt'].dt.minute
pandas_df['second'] = pandas_df['ltt'].dt.second
pandas_df['max'] = pandas_df.groupby('second')['last'].transform("max")
pandas_df["ma_1min"] = (
pandas_df.sort_values("ltt")
.groupby(["hour", "minute"])["last"]
.transform(lambda x: x.rolling('10sec', min_periods=1).mean())
)
print(pandas_df)

i don't know how to exactly implement this in your code but i had a kind of similar problem where i had to group each day into 4 hour timeslots. so an approach might be something like this:
pandas_df.groupby([pandas_df['ltt'].dt.hour, pandas_df['ltt'].dt.minute, (pandas_df['ltt'].dt.second / 10).astype(int)]).last.agg('mean')
this should basically give you 6 groups ([0s-9s -> 0], [10s-19s -> 1], etc. for the 3rd groupby index) for every minute of data.

Key error while appending the details to dictionary

I have test log like below. Trying to read it in better way. Got key error while adding elements to the dictionary. while checking the if condition there is no output is generated and while doing elif got key error
Jan 23 2016 10:30:08AM - bla bla Server-1A linked
Jan 23 2016 11:04:56AM - bla bla Server-1B linked
Jan 23 2016 1:18:32PM - bla bla Server-1B dislinked from server
Jan 23 2016 4:16:09PM - bla bla DOS activity from 201.10.0.4
Jan 23 2016 9:43:44PM - bla bla Server-1A dislinked from server
Feb 1 2016 12:40:28AM - bla bla Server-1A linked
Feb 1 2016 1:21:52AM - bla bla DOS activity from 192.168.123.4
Mar 29 2016 1:13:07PM - bla bla Server-1A dislinked from server
Code
result = []
_dict = {}
spu = []
with open(r'C:\Users\Desktop\test.log') as f:
for line in f:
date, rest = line.split(' - ', 1)
conn_disconn = rest.split(' ')[3]
server_name = rest.split(' ')[2]
if line.strip()[-1].isdigit():
dos = re.findall('[0-9]+(?:\.[0-9]+){3}',line)
spu.extend(dos)
##Error part is below
if conn_disconn == 'linked':
dict_to_append = {server_name: [(conn_disconn, date)]}
print (dict_to_append)
_dict[server_name] = dict_to_append
result.append(dict_to_append)
elif conn_disconn == 'dislinked':
_dict[server_name][server_name].append(conn_disconn,date)
del _dict[server_name]
print (result)
Expected out
[{'Server-1A': [('linked', 'Jan 23 2016 11:30:08AM'), ('dislinked', 'Jan 23 2016 10:43:44PM')]},
{'Server-1B': [('linked', 'Jan 23 2016 12:04:56AM'), ('dislinked', 'Jan 23 2016 2:18:32PM')]},
{'Server-1A': [('linked', 'Feb 1 2016 1:40:28AM'), ('dislinked', 'Mar 29 2016 2:13:07PM')]},
{'Server-1A': [('linked', 'Jan 23 2016 11:30:08AM'), ('dislinked', 'Jan 23 2016 10:43:44PM')]},
{'Server-1B': [('linked', 'Jan 23 2016 12:04:56AM'), ('dislinked', 'Jan 23 2016 2:18:32PM')]},
{'Server-1A': [('linked', 'Feb 1 2016 1:40:28AM'), ('dislinked', 'Mar 29 2016 2:13:07PM')]},
{'Server-1A': [('linked', 'Jan 23 2016 11:30:08AM'), ('dislinked', 'Jan 23 2016 10:43:44PM')]},
{'Server-1B': [('linked', 'Jan 23 2016 12:04:56AM'), ('dislinked', 'Jan 23 2016 2:18:32PM')]},
{'Server-1A': [('linked', 'Feb 1 2016 1:40:28AM'), ('dislinked', 'Mar 29 2016 2:13:07PM')]},
{Dos:['201.10.0.4','192.168.123.4']}]

When you are checking if conn_disconn == 'linked': , conn_disconn has linked\n so it is not adding to dictionary and you are getting the key error.
import re
result = []
_dict = {}
spu = []
with open("r'C:\Users\Desktop\test.log'") as f:
for line in f:
date, rest = line.split(' - ', 1)
conn_disconn = rest.split(' ')[3].strip()
server_name = rest.split(' ')[2]
if line.strip()[-1].isdigit():
dos = re.findall('[0-9]+(?:\.[0-9]+){3}',line)
spu.extend(dos)
##Error part is below
if conn_disconn == 'linked':
dict_to_append = {server_name: [(conn_disconn, date)]}
print (dict_to_append)
_dict[server_name] = dict_to_append[server_name]
result.append(dict_to_append)
elif conn_disconn == 'dislinked':
_dict[server_name].append((conn_disconn,date))
del _dict[server_name]
print (result)
Output:
[{'Server-1A': [('linked', 'Jan 23 2016 10:30:08AM'), ('dislinked', 'Jan 23 2016 9:43:44PM')]}, {'Server-1B': [('linked', 'Jan 23 2016 11:04:56AM'), ('dislinked', 'Jan 23 2016 1:18:32PM')]}, {'Server-1A': [('linked', 'Feb 1 2016 12:40:28AM'), ('dislinked', 'Mar 29 2016 1:13:07PM')]}]

append takes one argument but you have given two in some cases. Look at this line's append parameters in your code.
_dict[server_name][server_name].append(conn_disconn,date)
Instead of that you need to add parantheses in order to pass tuple like this:
_dict[server_name][server_name].append((conn_disconn,date))

Try this:
data=[]
dff.seek(0)
for line in dff:
try:
date = re.search(r'\b^.*PM|\b^.*AM', line).group()
server = re.search(r'\b(?:Server-\d[A-Z]|Server-1B)\b', line).group()
linked = re.search(r'\b(?:linked|dislinked)\b', line).group().split()[0]
except:
continue
data.append({server: [(linked, date)]})
data
Out[2374]:
#[{'Server-1A': [('linked', 'Jan 23 2016 10:30:08AM')]},
# {'Server-1B': [('linked', 'Jan 23 2016 11:04:56AM')]},
# {'Server-1B': [('dislinked', 'Jan 23 2016 1:18:32PM')]},
# {'Server-1A': [('dislinked', 'Jan 23 2016 9:43:44PM')]},
# {'Server-1A': [('linked', 'Feb 1 2016 12:40:28AM')]},
# {'Server-1A': [('dislinked', 'Mar 29 2016 1:13:07PM')]}#]

Extracting multiple date formats through regex in python

I am trying to extract date from text in python. These are the possible texts and date patterns in it.
"Auction details: 14 December 2016, Pukekohe Park"
"Auction details: 17 Feb 2017, Gold Sacs Road"
"Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)"
"Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)"
"Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)"
"Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)"
"Auction details: Thursday, 28th February '19"
"Auction details: Friday, 1st February '19"
This is what I have written so far,
mon = ' (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) '
day1 = r'\d{1,2}'
day_test = r'\d{1,2}(?:th)|\d{1,2}(?:st)'
year1 = r'\d{4}'
year2 = r'\(\d{4}\)'
dummy = r'.*'
This captures cases 1,2.
match = re.search(day1 + mon + year1, "Auction details: 14 December 2016, Pukekohe Park")
print match.group()
This somewhat captures case 3,4,5. But it prints everything from the text, so in the below case, I want 25 Nov 2016, but the below regex pattern gives me 25 Nov 3:00 p.m. (On Site)(2016).
So Question 1 : How to get only the date here?
match = re.search(day1 + mon + dummy + year2, "Friday 25 Nov 3:00 p.m. (On Site)(2016)")
print match.group()
Question 2 : Similarly, how do capture case 6,7 and 8 ?? What is the regex should be for that?
If not, is there any other better way to capture date from these formats?

You may try
((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s+\d{1,2}(?:st|nd|rd|th)?|\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)))(?:.*(\b\d{2}(?:\d{2})?\b))?
See the regex demo.
Note I made all groups in the regex blocks non-capturing ((Nov|Dec) -> (?:Nov|Dec)), added (?:st|nd|rd|th)? optional group after day digit pattern, changed the year matching pattern to \b\d{2}(?:\d{2})?\b so that it only match 4- or 2-digit chunks as whole words, and created an alternation group to account for dates where day comes before month and vice versa.
The day and month are captured into Group 1 and the year is captured into Group 2, so the result is the concatenation of both.
NOTE: In case you need to match years in a safer way you may want to precise the year pattern. E.g., if you want to avoid matching the 4- or 2-digit whole words after :, add a negative lookbehind:
year1 = r'\b(?<!:)\d{2}(?:\d{2})?\b'
^^^^^^
Also, you may add word boundaries around the whole pattern to ensure a whole word match.
Here is the Python demo:
import re
mon = r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)'
day1 = r'\d{1,2}(?:st|nd|rd|th)?'
year1 = r'\b\d{2}(?:\d{2})?\b'
dummy = r'.*'
rx = r"((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
# Or, try this if a partial number before a date is parsed as day:
# rx = r"\b((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
strs = ["Auction details: 14 December 2016, Pukekohe Park","Auction details: 17 Feb 2017, Gold Sacs Road","Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)","Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)","Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)","Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)","Auction details: Thursday, 28th February '19","Auction details: Friday, 1st February '19","Friday 25 Nov 3:00 p.m. (On Site)(2016)"]
for s in strs:
print(s)
m = re.search(rx, s)
if m:
print("{} {}".format(m.group(1), m.group(2)))
else:
print("NO MATCH")
Output:
Auction details: 14 December 2016, Pukekohe Park
14 December 2016
Auction details: 17 Feb 2017, Gold Sacs Road
17 Feb 2017
Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)
27 Apr 2016
Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)
27 Apr 2016
Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)
27 Apr 2016
Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)
November 16 2016
Auction details: Thursday, 28th February '19
28th February 19
Auction details: Friday, 1st February '19
1st February 19
Friday 25 Nov 3:00 p.m. (On Site)(2016)
25 Nov 2016

split element inside list python [duplicate]

This question already has answers here:
Is there a way to split a string by every nth separator in Python?
(6 answers)
Closed 4 years ago.
I was trying to split the element inside the list based on certain length, here is the list ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']. Could any one help me to retrieve the values from the list in the following format:
['Mar 18','Mar 17','Mar 16','Mar 15','Mar 14']

A regular expression based approach that would handle cases like Apr 1 or Dec 31 as well as multiple elements in the initial list:
import re
lst = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
[x for y in lst for x in re.findall(r'[A-Z][a-z]+ \d{1,2}', y)]
# ['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14']

list = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
span = 2
s = list[0].split(" ")
s = [" ".join(words[i:i+span]) for i in range(0, len(s), span)]
print(s)
For me, this prints
['Mar 18','Mar 17','Mar 16','Mar 15','Mar 14']
Taken from this answer.

Try this code !
You can do it by the concept of regular expression (just by import re library in python)
import re
lst = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14 Mar 2']
obj = re.findall(r'[A-Z][a-z]+[ ](?:\d{2}|\d{1})', lst[0])
print(obj)
Output :
['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14', 'Mar 2']

You can also try this one
>>> list = ['Mar 18 Mar 17 Mar 16 Mar 15 Mar 14']
>>> result = list[0].split(" ")
>>> [i+' '+j for i,j in zip(result[::2], result[1::2])]
Output
['Mar 18', 'Mar 17', 'Mar 16', 'Mar 15', 'Mar 14']

Concatenate ListA elements with partially matching ListB elements

Say I have two python lists as:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
I need to get an output as:
List_op = ['Jan 2018 Sales Jan 2018 Units sold Jan 2018','Feb 2018 Sales Feb 2018 Units sold Feb 2018','Mar 2018 Sales Mar 2018 Units sold Mar 2018']
My approach so far:
res=set()
for i in ListB:
for j in ListA:
if j in i:
res.add(f'{i} {j}')
print (res)
this gives me result as:
{'Units sold Jan 2018 Jan 2018', 'Sales Feb 2018 Feb 2018', 'Units sold Mar 2018 Mar 2018', 'Units sold Feb 2018 Feb 2018', 'Sales Jan 2018 Jan 2018', 'Sales Mar 2018 Mar 2018'}
which is definitely not the solution I'm looking for.
What I think is regular expression could be a handful here but I'm not sure how to approach. Any help in this regard is highly appreciated.
Thanks in advance.
Edit:
Values in ListA and ListB are not necessarily to be in order. Therefore for a particular month/year value in ListA, the same month/year value from ListB has to be matched and picked for both 'Sales' and 'Units sold' component and needs to be concatenated.
My main goal here is to get the list which I can use later to generate a statement that I'll be using to write Hive query.
Added more explanation as suggested by #andrew_reece

Assuming no additional edge cases that need taking care of, your original code is not bad, just needs a slight update:
List_op = []
for a in ListA:
combined = a
for b in ListB:
if a in b:
combined += " " + b
List_op.append(combined)
List_op
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']

Supposing ListA and ListB are sorted:
ListA = ['Jan 2018', 'Feb 2018', 'Mar 2018']
ListB = ['Sales Jan 2018','Units sold Jan 2018','Sales Feb 2018','Units sold Feb 2018','Sales Mar 2018','Units sold Mar 2018']
print([v1 + " " + v2 for v1, v2 in zip(ListA, [v1 + " " + v2 for v1, v2 in zip(ListB[::2], ListB[1::2])])])
This will print:
['Jan 2018 Sales Jan 2018 Units sold Jan 2018', 'Feb 2018 Sales Feb 2018 Units sold Feb 2018', 'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
In my example I firstly concatenate ListB variables together and then join ListA with this new list.

String concatenation can become expensive. In Python 3.6+, you can use more efficient f-strings within a list comprehension:
res = [f'{i} {j} {k}' for i, j, k in zip(ListA, ListB[::2], ListB[1::2])]
print(res)
['Jan 2018 Sales Jan 2018 Units sold Jan 2018',
'Feb 2018 Sales Feb 2018 Units sold Feb 2018',
'Mar 2018 Sales Mar 2018 Units sold Mar 2018']
Using itertools.islice, you can avoid the expense of creating new lists:
from itertools import islice
zipper = zip(ListA, islice(ListB, 0, None, 2), islice(ListB, 1, None, 2))
res = [f'{i} {j} {k}' for i, j, k in zipper]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing substring from list of strings - python

With your shown samples, please try following. df['Period'].replace(r" \d{2}", "'", regex=True) Output will be as follows. 0 Mar'18 1 Jun'18 2 Sep'18 3 Dec'18 4 Mar'19 5 Jun'19 6 Sep'19 7 Dec'19 8 Mar'20 9 Jun'20 10 Sep'20 11 Dec'20

try this regex: df['Period'].str.replace(r"\s\d{2}(\d{2})", r"'\1", regex=True) in the replacement part, \1 refers to the capturing group, which is the last two digits in this case.

Related

error when using time in rolling function pandas

Key error while appending the details to dictionary

Extracting multiple date formats through regex in python

split element inside list python [duplicate]

Concatenate ListA elements with partially matching ListB elements

Categories

Resources