pandas extracting substring from column - python

Suppose I have a df:
df = pd.DataFrame({'col': ['ABCXDEF', 'ABCYDEF']})
How can I extract the string that is surrounded by ABC & the first occurrence of DEF? Desired output:
col
0 X
1 Y
Note that I don't want a solution based on exact positions, like:
df.col.str[3:4]

(update: look for the first occurrence of 'DEF')
Use this regex:
df = pd.DataFrame({'col': ['ABCXDEF', 'ABCYDEFDEFDEF']})
print(df.col.str.extract(r"ABC(.*?)DEF"))
The result is:
0
0 X
1 Y

Related

Parse only specific characters from the string using python

Trying to split and parse characters from an column and submit the parsed data into different column .
I was trying the same by parsing with _ in the given column data, It was working good until the number of '_' present in the string was fixed to 2.
Input Data:
Col1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88_(2).jpg.pdf
V_C56478_mar89Apr89
Q_d15634_Apr90Apr91
Q_d15634_Apr90Apr91_(3).jpeg.pdf
S_e15336_may91Apr93
NaN
Expected Output:
col2
Jan87Apr88
Feb88Apr88
mar89Apr89
Apr90Apr91
Apr90Apr91
may91Apr93
Code i have been trying :
df = pd.read_excel(open(r'Dats.xlsx', 'rb'), sheet_name='Sheet1')
df['Col2'] = df.Col1.str.replace(
'.*_', '', regex=True
)
print(df['Col2'])
I think you want this:
col2 = df.Col1.str.split("_", expand=True)[2]
output:
0 Jan87Apr88
1 Feb88Apr88
2 mar89Apr89
3 Apr90Apr91
4 Apr90Apr91
5 may91Apr93
6 NaN
(you can dropna if you don't want the last row)
Use str.extract here:
df["col2"] = df["Col1"].str.extract(r'((?:[a-z]{3}\d{2}){2})', flags=re.IGNORECASE)
Demo
Based on your question, the pandas DataFrame apply can be a good solution:
First, clean the DataFrame by replacing NaNs with empty string ''
df = pd.DataFrame(data=['U_a65839_Jan87Apr88', 'U_b98652_Feb88Apr88_(2).jpg.pdf', 'V_C56478_mar89Apr89', 'Q_d15634_Apr90Apr91', 'Q_d15634_Apr90Apr91_(3).jpeg.pdf', 'S_e15336_may91Apr93', None], columns=['Col1'])
df = df.fillna('')
Col1
0 U_a65839_Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf
2 V_C56478_mar89Apr89
3 Q_d15634_Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf
5 S_e15336_may91Apr93
6
Next, define a function to extract the required string with regex
def fun(s):
import re
m = re.search(r'\w{3}\d{2}\w{3}\d{2}', s)
if m:
return m.group(0)
else:
return ''
Then, easily apply the function to DataFrame:
df['Col2'] = df['Col1'].apply(fun)
Col1 Col2
0 U_a65839_Jan87Apr88 Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf Feb88Apr88
2 V_C56478_mar89Apr89 mar89Apr89
3 Q_d15634_Apr90Apr91 Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf Apr90Apr91
5 S_e15336_may91Apr93 may91Apr93
6
Hope the above helps.

Changing values in dataframe based on cell and column name

I have a dataframe
df=pd.DataFrame( [0,1,2],columns=[‘3m3a’,’1z6n’,’11p66d’])
Now i would like to apply 2 * value * (last numbers of column name). Eg for the last 2 * 2* 66
Df.apply(lambda x: 2*x) for step 1
Step 2 is the hardest part
Can do new dataframe like df2=df.stack().reset_index().apply(lambda x: x[re.search(‘[azAZ]+’,x).end():]) and then multiple the 2.
What’s a more pythonic way?
For DataFrame:
3m3a 1z6n 11p66d
0 0 1 2
You can use .colums.str.extract and then DataFrame.multiply:
vals = df.columns.str.extract(r"(\d+)[a-z]*?$").T.astype(int)
df = df.multiply(2 * vals.values, axis=1)
print(df)
Prints:
3m3a 1z6n 11p66d
0 0 12 264
Late to the party, and having found almost the same answer, but using negative look-behind regex:
newdf = df.multiply(
2 * df.columns.str.extract(r'.*(?<!\d)(\d+)\D*').astype(int).values.ravel(),
axis=1)
>>> newdf
3m3a 1z6n 11p66d
0 0 12 264
Thank you, that both works
what if i would like to split the column in 2 parts, one up to and including the first letter, and the second the part after
df.columns.str.split(r"(\d+\D+)",n=1,expand=True)
work but give me a 3 part with first blank

How can I use split() in a string when broadcasting a dataframe's column?

Take the following dataframe:
df = pd.DataFrame({'col_1':[0, 1], 'col_2':['here 123', 'here 456']})
Result:
col_1 col_2
0 0 here 123
1 1 here 456
I need to create a 3rd column (broadcasting), using a condition on col_1, and splitting the string on col_2. This is ok to do:
df['col_3'] = float('NaN')
df.loc[df['col_1'] == 1, ['col_3']] = df['col_2'].str.slice(5, 8)
Result:
col_1 col_2 col_3
0 0 here 123 NaN
1 1 here 456 456
But I need to specify dynamic indexes to split the string on col_2, instead of (5, 8).
When I try to run the following code it does not work, because df['col_2'] is treated as a Series:
df.loc[df['col_1'] == 1, ['col_3']] = df['col_2'].split(' ')[0]
I'm spending a huge time trying to solve this without needing to iterate the dataframe.
This one liner does the trick.
df['col_3']=[y.split(' ')[1] if x==1 else float('nan') for x,y in df[['col_1','col_2']].values]
use np.where to get which condition to get the split part.
df['col_3'] = np.where(df.col_1 == 1,
df['col_2'].str.split().str[-1],
np.nan)

Filter rows in df based on number of string occurances

I’d like to filter rows in a df based on the condition that the row contains mentions of 2+ strings in a long list of strings. I’m having trouble specifying the number of occurrences. Here is my code so far:
brands = ["a", "b", "c"]
df[df.Column.str.contains('|'.join(brands), re.IGNORECASE, regex=True, na=False)]
Maybe using count instead of contains:
df = pd.DataFrame({'Column':['ab','ABx','axy']})
df[df.Column.str.count('|'.join(brands), re.IGNORECASE, na=False)>=2]
Output:
Column
0 ab
1 ABx
Here's an example of the check. Is this what you are trying to do?
import pandas as pd
df = pd.DataFrame({'Test':['a','b','cat','dog']})
print (df)
df['check']= df['Test'].isin(['a','b','c'])
print (df)
This will result in:
Test check
0 a True
1 b True
2 cat False
3 dog False

python pandas filter by part of rows existing in a list

i have this DataFrame:
df = pandas.DataFrame({'A' : [data1|context1, data2|context2, data3|context3, data4|context4]})
resulting:
A
0 data1|context1
1 data2|context2
2 data3|context3
3 data4|context4
i also have this list:
items = ['data1', 'data3']
I want to get the Dataframe rows which do not have their left part of | in the list. How do i filter only by the left part of each row? I know only how to filter by the entire row, but not by part of it.
This should be the result:
A
0 data2|context2
1 data4|context4
Edit: obtining this result with pandas would be more efficient than getting the values in a list comprehensive?
You could use a boolean mask based on match:
import pandas as pd
items = ['data1', 'data3']
df = pd.DataFrame({'A': ['data1|context1', 'data2|context2', 'data3|context3', 'data4|context4']})
mask = df.A.str.match('^(?!{})'.format('|'.join(items)))
result = df[mask]
print(result)
Output
A
1 data2|context2
3 data4|context4
The statement '^(?!{})'.format('|'.join(items)) becomes ^(?!data1|data3) that means not start with neither 'data1' nor 'data3'. If you prefer a one-liner you can do:
result = df.loc[df.A.str.match('^(?!{})'.format('|'.join(items)))]
use
df.loc[df['A'].str.split('|').apply(lambda x: x[0] not in items )]
Output
A
1 data2|context2
3 data4|context4
This can be done using extract
print(df.loc[~df.A.str.extract(r'([^|]+)').isin(items)[0]].reset_index(drop=True))
Output:
A
0 data2|context2
1 data4|context4

Categories