Extract columns from string - python

I have a pandas df column containing the following strings:
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
I would like to extract data from strings and organize it as columns. As you can see, not all rows contain the same data and they are not in the same order. I only need some of the columns; this is the expected output:
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
I made some tests with str.extract but couldn't get what I want.
Any ideas on how to achieve it?
Thanks

You could try this using string methods. Assuming that the strings are stored in a column named 'main_col':
df["Type"] = df.main_col.str.split("(", expand = True)[0]
df["conId"] = df.main_col.str.partition("conId=")[2].str.partition(",")[0]
df["symbol"] = df.main_col.str.partition("symbol=")[2].str.partition(",")[0]
df["localSymbol"] = df.main_col.str.partition("localSymbol=")[2].str.partition(",")[0]

One solution using pandas.Series.str.extract (as you tried using it):
>>> df
col
0 Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1 Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2 Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')
>>> df.col.str.extract(r"^(?P<Type>Future|Stock).*conId=(?P<conId>\d+).*symbol='(?P<symbol>[A-Z]+)'.*localSymbol='(?P<localSymbol>[A-Z0-9]+)'")
Type conId symbol localSymbol
0 Future 462009617 CGB CGBZ21
1 Stock 80268543 IJPA IJPA
2 Stock 153454120 EMIM EMIM
In the above, I assume that:
Type takes the two values Future or Stock
conId consists of digits
symbol consists of capital alphabet letters
localSymbol consists of digits and capital alphabet letters
You may want to adapt the pattern to better fit your needs.

Related

extract certain words from column in a pandas df

I have a pandas df in which one column is the message and having a string and have data like below:-
df['message']
2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n
So I want to extract the raddr from the data and join it back to the df.
I am doing it with the code below and thought that its on position 7 after the split:-
df[['raddr']]=df['message'].str.split(' ', 100, expand=True)[[7]]
df['raddr']=df['raddr'].str[6:]
the issue is in some columns it's coming at 8 and some at 7 so in some columns, it gives me a report and not radar because of the issue.
How can I extract that so that it will extract it on a string search and not using split?
Note:- Also, I want a faster approach as I am doing in on hunters of thousands of records every minute.
You can use series.str.extract
df['raddr'] = df['message'].str.extract(r'raddr=([\d\.]*)') # not tested
The pattern has only one capturing group with the value after the equal sign. It will capture any combination of digits and periods until it finds something else (a blank space, letter, symbol, or end of line).
>>> import re
>>> s = '2020-09-23T22:38:34-04:00 mpp-xyz-010101-10-103.vvv0x.net patchpanel[1329]: RTP:a=end pp=10.10.10.10:9999 user=sip:.F02cf9f54b89a48e79772598007efc8c5.#user.com;tag=2021005845 lport=12270 raddr=11.00.111.212 rport=3004 d=5 arx=0.000 tx=0.000 fo=0.000 txf=0.000 bi=11004 bo=453 pi=122 pl=0 ps=0 rtt="" font=0 ua=funny-SDK-4.11.2.34441.fdc6567fW jc=10 no-rtp=0 cid=2164444 relog=0 vxdi=0 vxdo=0 vxdr=0\n'
>>> re.search('raddr=.*?\s',s).group()
'raddr=11.00.111.212 '

Match regex to its type in another dataframe

How to match data value with its regex type but the regex is in another dataframe? Here is the sample Data df and Regex df. Note that these two df have different shape as the regex df is just reference df and only contain unique value.
**Data df** **Regex df**
**Country Type Data** **Country Type Regex**
MY ABC MY1234567890 MY ABC ^MY[0-9]{10}
IT ABC IT1234567890 IT ABC ^IT[0-9]{10}
PL PQR PL123456 PL PQR ^PL
MY ABC 456792abc MY DEF ^\w{6,10}$
IT ABC MY45889976 IT XYZ ^\w{6,10}$
IT ABC IT56788897
For the data that is not match to its own regex, how can I find match for the data with its Country but scan through all the type that the country has. For example, this data 'MY45889976' does not follow its regex (IT) country and (ABC) type. But it match with another type for its country which is the (XYZ) type. So it will add another column and give the type that it match with.
My desired output is something like this,
Country Type Data Data Quality Suggestion
0 MY ABC MY1234567890 1 0
1 IT ABC IT1234567890 1 0
2 IT ABC MY45889976 0 XYZ
3 IT ABC IT567888976 0 XYZ
4 PL PQR PL123456 1 0
5 MY XYZ 456792abc 0 DEF
This is what I have done to match the regex to get the data quality column (after concatenation),
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
But I'm not sure how to move forward. Is there any easy way to do this without concatenation and how to find matching regex by scanning its whole type but tie to its country only. Thanks
refer to:Match column with its own regex in another column Python
just apply a new Coumun suggestion, it's logic depend on your description.
def func(dfRow):
#find the same Country and Type
sameDF = regexDF.loc[(regexDF['Country'] == dfRow['Country']) & (regexDF['Type'] == dfRow['Type'])]
if sameDF.size > 0 and re.match(sameDF.iloc[0]["Regex"],dfRow["Data"]):
return 0
#find the same Country, then find mathec Type
sameCountryDF = regexDF.loc[(regexDF['Country'] == dfRow['Country'])]
for index, row in sameCountryDF.iterrows():
if re.match(row["Regex"], dfRow["Data"]):
return row["Type"]
df["Suggestion"]=df.apply(func, axis=1)
I suggest the following, merging by Country and doing both operations in the same DataFrame (finding regex that match for the type in data_df and for the type in regex_df) as follows:
# First I merge only on country
new_df = pd.merge(df, df_regex, on="Country")
# Then I define an indicator for types that differ between the two DF
new_df["indicator"] = np.where(new_df["Type_x"] == new_df["Type_y"], "both", "right")
# I see if the regex matches Data for the `Type` in df
new_df['Data Quality'] = new_df.apply(lambda x:
np.where(re.match(x['Regex'], x['Data']) and
(x["indicator"] == "both"),
1, 0), axis=1)
# Then I fill Suggestion by looking if the regex matches data for the type in df_regex
new_df['Suggestion'] = new_df.apply(lambda x:
np.where(re.match(x['Regex'], x['Data']) and
(x["indicator"] == "right"),
x["Type_y"], ""), axis=1)
# I remove lines where there is no suggestion and I just added lines from df_regex
new_df = new_df.loc[~((new_df["indicator"] == "right") & (new_df["Suggestion"] == "")), :]
new_df = new_df.sort_values(["Country", "Type_x", "Data"])
# After sorting I move Suggestion up one line
new_df["Suggestion"] = new_df["Suggestion"].shift(periods=-1)
new_df = new_df.loc[new_df["indicator"] == "both", :]
new_df = new_df.drop(columns=["indicator", "Type_y", "Regex"]).fillna("")
And you get this result:
Country Type_x Data Data Quality Suggestion
4 IT ABC IT1234567890 1
8 IT ABC IT56788897 0 XYZ
6 IT ABC MY45889976 0 XYZ
2 MY ABC 456792abc 0 DEF
0 MY ABC MY1234567890 1
10 PL PQR PL123456 1
The last line of your output seems to have the wrong Type since it is not in data_df.
By using your sample data I find ABC for Data == "456792abc" and your suggestion DEF.

How to convert a data frame column in python?

After reading from large excel I have the following data
Mode Fiscal Year/Period Amount
ABC 12.2001 10243.00
CAB 2.201 987.87
I need to convert the above data frame as below
Mode Fiscal Year/Period Amount
ABC 012.2001 10243.00
CAB 002.2010 987.87
need help in converting the Fiscal Year/Period column.
It is always easier for us and you will get better help if you provide your attempts at the solution (your code).
Try this,
import pandas as pd
Recreating your data
data = {'mode':['abc', 'cab'], 'Fiscal Year/Period':[12.2001, 2.201]}
And put it in a dataframe,
data=pd.DataFrame(data)
Convert the column to a str,
data['Fiscal Year/Period']=data['Fiscal Year/Period'].astype(str)
And use zfill() to fill with zeros
data['Fiscal Year/Period'].apply(lambda x: x.zfill(8))
yields,
0 012.2001
1 0002.201
Name: Fiscal Year/Period, dtype: object
IIUC, you can just zfill and ljust
s = df['Fiscal_Year/Period'].str.split('.',expand=True)
s[0] = s[0].str.zfill(3)
s[1] = s[1].str.ljust(4,'0')
df['Year'] = s.agg('.'.join,axis=1)
print(df)
Mode Fiscal_Year/Period Amount Year
0 ABC 12.2001 10243.00 012.2001
1 CAB 2.201 987.87 002.2010

how do i seperate two numbers in pandas

how do I separate two numbers if the length of the number is 18 . what I exactly wanna do is I want to separate mobile number(10) and landline number(8) when they are joined(18).
I have tried to extract first 8 numbers but I don't know how to apply condition. and I need to remove first 8 numbers if the condition satisfies
df['Landline'] = df['Number'].str[:8]
I have tried this but I know its wrong
df['Landline'] = df['Number'].apply(lambda x : x.str[:8] if len(x)==18 )
For extracting first 8 numbers, use findall.
df['Number'].str.findall('^\d{8}')
Solution using an example
Here we use the Dummy Data made in the following section.
# separate landline and mobile numbers
phone_numbers = df.Numbers.str.findall('(^\d{8})*(\d{10})').tolist()
# store in a dict
d = dict((i, {'Landline': e[0][0], 'Mobile': e[0][1]}) for i, e in enumerate(phone_numbers))
# make a temporary dataframe
phone_df = pd.DataFrame(d).T
# update original dataframe
df['Landline'] = phone_df['Landline']
df['Mobile'] = phone_df['Mobile']
print(df)
Output:
Numbers Landline Mobile
0 123456780123456789 12345678 0123456789
1 0123456789 0123456789
Dummy Data
df = pd.DataFrame({'Numbers': ['123456780123456789', '0123456789', ]})
print(df)
Output:
Numbers
0 123456780123456789
1 0123456789
Looks like you need
df['Landline'] = df['Number'].apply(lambda x : x[:8] if len(x)==18 else x)

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

Categories