Split period into two dates when the date has the same delimiter - python

Goal: derive period start and period end from the column period, in the form of
dd.mm.yyyy - dd.mm.yyyy
period
28-02-2022 - 30.09.2022
31.01.2022 - 31.12.2022
28.02.2019 - 30-04-2020
20.01.2019-22.02.2020
19.03.2020- 24.05.2021
13.09.2022-12-10.2022
df[['period_start,'period_end]]= df['period'].str.split('-',expand=True)
will not work.
Expected output
period_start period_end
31.02.2022 30.09.2022
31.01.2022 31.12.2022
28.02.2019 30.04.2020
20.01.2019 22.02.2020
19.03.2020 24.05.2021
13.09.2022 12.10.2022

We can use str.extract here for one option:
df[["period_start", "period_end"]] = df["period"].str.extract(r'(\S+)\s*-\s*(\S+)')
.str.replace(r'-', '.')

the problem is you were trying to split on dash, and there's many dashes in the one row, this work :
df[['period_start','period_end']]= df['period'].str.split(' - ',expand=True)
because we split on space + dash

Use a regex to split on the dash with surrounding spaces:
out = (df['period'].str.split(r'\s+-\s+',expand=True)
.set_axis(['period_start', 'period_end'], axis=1)
)
or to remove the column and create new ones:
df[['period_start', 'period_end']] = df.pop('period').str.split(r'\s+-\s+',expand=True)
output:
period_start period_end
0 31-02-2022 30.09.2022
1 31.01.2022 31.12.2022
2 28.02.2019 30-04-2020

Related

how to convert in python negative value objects in dataframe to float

I'd like to create function a which converts objects to float. Tried to find some solution, but always get some errors:
# sample dataframe
d = {'price':['−$13.79', '−$ 13234.79', '$ 132834.79', 'R$ 75.900,00', 'R$ 69.375,12', '- $ 2344.92']}
df = pd.DataFrame(data=d)
I tried this code, first wanted just to find solution.
df['price'] = (df.price.str.replace("−$", "-").str.replace(r'\w+\$\s+', '').str.replace('.', '')\
.str.replace(',', '').astype(float)) / 100
So idea was to convert -$ to - (for negative values). Then $ to ''.
But as a result I get:
ValueError: could not convert string to float: '−$1379'
You can extract the numbers on one side, and identify whether there is a minus in the other side, then combine:
factor = np.where(df['price'].str.match(r'[−-]'), -1, 1)/100
out = (pd.to_numeric(df['price'].str.replace(r'\D', '', regex=True), errors='coerce')
.mul(factor)
)
output:
0 -13.79
1 -13234.79
2 132834.79
3 75900.00
4 69375.12
5 -2344.92
Name: price, dtype: float64
Can you use re ?
Like this:
import re
df['price'] = float(re.sub(r'[^\-.0-9]', '', df.price.str)) / 100
I'm just removing by regex all the symbols that are not 0-9, ".", "," & "-".
BTW, no clue why you divide it by 100...
df["price2"] = pd.to_numeric(df["price"].str.replace("[R$\s\.,]", "")) / 100
df["price3"] = df["price"].str.replace("[R$\s\.,]", "").astype(float) / 100
df
A few notes:
The dot is the regex symbel for everything.
The - symbel you are using is not a minus. Its something else.
df["price2"] = pd.to_numeric(df["price"].str.replace("[R$\s\.,]", "")) / 100
df["price3"] = df["price"].str.replace("[R$\s\.,]", "").astype(float) / 100
df
A few notes:
The dot is the regex symbel for everything.
The - symbel you are using is not a minus. Its something else.
I would use something like https://regex101.com for debugging.

Extract columns from string

I have a pandas df column containing the following strings:
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
I would like to extract data from strings and organize it as columns. As you can see, not all rows contain the same data and they are not in the same order. I only need some of the columns; this is the expected output:
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
I made some tests with str.extract but couldn't get what I want.
Any ideas on how to achieve it?
Thanks
You could try this using string methods. Assuming that the strings are stored in a column named 'main_col':
df["Type"] = df.main_col.str.split("(", expand = True)[0]
df["conId"] = df.main_col.str.partition("conId=")[2].str.partition(",")[0]
df["symbol"] = df.main_col.str.partition("symbol=")[2].str.partition(",")[0]
df["localSymbol"] = df.main_col.str.partition("localSymbol=")[2].str.partition(",")[0]
One solution using pandas.Series.str.extract (as you tried using it):
>>> df
col
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
>>> df.col.str.extract(r"^(?P<Type>Future|Stock).*conId=(?P<conId>\d+).*symbol='(?P<symbol>[A-Z]+)'.*localSymbol='(?P<localSymbol>[A-Z0-9]+)'")
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
In the above, I assume that:
Type takes the two values Future or Stock
conId consists of digits
symbol consists of capital alphabet letters
localSymbol consists of digits and capital alphabet letters
You may want to adapt the pattern to better fit your needs.

Pandas strip string from column

I have a csv file that looks as follow:
code. timestamp. message. name. Id. action
1. time. message. name. some text - id action
I would like to target the id column and strip everything starting from the space after the - until the beginning for the string if the column contains a -.
basically this is what I would like as an output.
code. timestamp. message. name. Id. action
1. <time.> <message.> <name.> id <action>
looking at some documentation I found this solution.
df['id'] = df['id'].map(lambda x: x.lstrip('-').rstrip(' - '))
but this just strips everything from the left and the right of the - which is not the result I want.
can anyone help me to understand how can I target that space after the - and remove everything before it please?
Given:
Col1
0 some text - id
1 remove this text - 012345
Try:
import pandas as pd
mydf = pd.DataFrame({'Col1':['some text - id', 'remove this text - 012345']})
mydf['Col1'] = mydf['Col1'].str.extract(r'- (.*)')
print(mydf)
Outputs:
Col1
0 id
1 012345
You can first replace everything before - and then strip to remove extra spaces.
df["id"].str.replace(r".*-", "").str.strip()
You can use str.split() to split the strings on - and then get the right part of the string, as follows:
df['id'] = df['id'].str.split(r'- ').str[-1]
If there can be more than one occurrence of - in the string and you want to split only on the first occurrence, you can use:
df['id'] = df['id'].str.split(r'- ', n=1).str[-1]
Demo
df = pd.DataFrame({'id':['some text - id', 'another text - 2nd hypen - abc']})
id
0 some text - id
1 another text - 2nd hypen - abc
df['id'] = df['id'].str.split(r'- ').str[-1]
id
0 id
1 abc
df = pd.DataFrame({'id':['some text - id', 'another text - 2nd hypen - abc']})
id
0 some text - id
1 another text - 2nd hypen - abc
df['id'] = df['id'].str.split(r'- ', n=1).str[-1]
id
0 id
1 2nd hypen - abc

extract certain words from column in a pandas df

I have a pandas df in which one column is the message and having a string and have data like below:-
df['message']
2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n
So I want to extract the raddr from the data and join it back to the df.
I am doing it with the code below and thought that its on position 7 after the split:-
df[['raddr']]=df['message'].str.split(' ', 100, expand=True)[[7]]
df['raddr']=df['raddr'].str[6:]
the issue is in some columns it's coming at 8 and some at 7 so in some columns, it gives me a report and not radar because of the issue.
How can I extract that so that it will extract it on a string search and not using split?
Note:- Also, I want a faster approach as I am doing in on hunters of thousands of records every minute.
You can use series.str.extract
df['raddr'] = df['message'].str.extract(r'raddr=([\d\.]*)') # not tested
The pattern has only one capturing group with the value after the equal sign. It will capture any combination of digits and periods until it finds something else (a blank space, letter, symbol, or end of line).
>>> import re
>>> s = '2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n'
>>> re.search('raddr=.*?\s',s).group()
'raddr=11.00.111.212 '

ValueError: Columns must be same length as key (Split column in multiple columns using python)

The question has been asked a lot, however I'm still not close to the solution. I have a column which looks something like this
What I want to do is separate the country and language in different columns like
Country Language
Vietnam Vietnamese_display 1
Indonesia Tamil__1
India Tamil_Video_5
I'm using the following code to get it done however there are a lot of factors that needs to be taken into account and I'm not sure how to do it
df[['Country', 'Language']] = df['Line Item'].str.split('_\s+', n=1, expand=True)
How can I skip the first "_" to get my desired results? Thanks
You may use
df[['Country', 'Language']] = df['Line Item'].str.extract(r'^_*([^_]+)_(.+)')
See the regex demo
Details
^ - start of string
_* - 0 or more underscores
([^_]+) - Capturing group 1: any one or more chars other than _
_ - a _ char
(.+) - Group 2: any one or more chars other than line break chars.
Pandas test:
df = pd.DataFrame({'Line Item': ['Vietnam_Vietnamese_display 1','Indonesia_Tamil__1','India_Tamil_Video_5']})
df[['Country', 'Language']] = df['Line Item'].str.extract(r'^_*([^_]+)_(.+)')
df
# Line Item Country Language
# 0 Vietnam_Vietnamese_display 1 Vietnam Vietnamese_display 1
# 1 Indonesia_Tamil__1 Indonesia Tamil__1
# 2 India_Tamil_Video_5 India Tamil_Video_5

Categories