I want to split the column by some specific words and keep the delimiter in the same time. I tried to split the column with str.split but the result isn't I want.
example data(test.csv):
a
abc123and321abcor213cba
abc321or123cbaand321cba
my code:
import pandas as pd
df = pd.read_csv('test.csv')
df[['b','c']] = df['a'].str.split("and",1,expand=True)
df[['c','d']] = df['c'].str.split("or",1,expand=True)
print(df)
my result:
a b c d
0 abc123and321abcor213cba abc123 321abc 213cba
1 abc321or123cbaand321cba abc321or123cba 321cba None
Desired result:
a b c d
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
How can I do this?
Borrowing from Tim's answer using a lookbehind Regex to split on and or or, without using up the seperating string in the split:
d = {'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]}
df = pandas.DataFrame(data=d)
df[["b", "c", "d"]] = df['a'].str.split(r'(?<=and)|(?<=or)', expand=True)
Output:
a b c d
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
If you have issues with split check that you don't have a too old pandas version.
You could also use str.extractall and unstack. I'll also recommend to usejoin to add the columns if you don't know in advance the number of matches/columns?
df = pd.DataFrame({'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]})
df.join(df['a'].str.extractall(r'(.*?(?:and|or)|.+$)')[0].unstack('match'))
Output:
a 0 1 2
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
Try splitting on the lookbehind (?<=and|or):
df[['b', 'c', 'd']] = df['a'].str.split(r'(?<=and)|(?<=or)', 1, expand=True)
Related
I have a data-frame and one of its columns are a string which separated with dash. I want to get the part before the dash. Could you help me with that?
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 5]
df['b'] = ['C-C02','R-C05','R-C01','C-C06', 'RC-C06']
The desire output is:
You could use str.replace to remove the - and all characters after it:
df['b'] = df['b'].str.replace(r'-.*$', '', regex=True)
Output:
a b
0 1 C
1 2 R
2 3 R
3 4 C
4 5 RC
You want to split each string on the '-' character and keep the part before it:
df['c'] = [s.split('-')[0] for s in df['b']]
I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B
You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)
I am trying to replace my column names that have quotations and simply remove the quotations but when I try this:
for x in df.columns:
x = x.replace('"', '')
print(x)
Nothing happens and the quotations are still there.
I would do something like this
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
CODE
import pandas as pd
df=pd.DataFrame({"a":[1,2],'"b"':[3,4]})
print('BEFORE')
print(df)
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
print('AFTER')
print(df)
OUTPUT
BEFORE
a "b"
0 1 3
1 2 4
AFTER
a b
0 1 3
1 2 4
you can remove it by writing the following code:
col=[]
for x in df.columns:
x = x.replace('"', '')
col.append(x)
df.columns=col
To know more about column renaming: Check this Renaming columns in pandas
One canonical solution to this problem is using pandas str.replace on the header directly (this is "vectorized"):
df = pd.DataFrame({"a": [1, 2], '"b"': [3, 4]})
df.columns = df.columns.str.replace('"', '')
df
a b
0 1 3
1 2 4
I have a dataframe with column a. I need to get data after second _.
a
0 abc_def12_0520_123
1 def_ghij123_0120_456
raw_data = {'a': ['abc_def12_0520_123', 'def_ghij123_0120_456']}
df = pd.DataFrame(raw_data, columns = ['a'])
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
What I have tried:
df['b'] = df.number.str.replace('\D+', '')
I tried removing alphabets first, But its getting complex. Any suggestions
Here is how:
df['b'] = ['_'.join(s.split('_')[2:]) for s in df['a']]
print(df)
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
Explanation:
lst = ['_'.join(s.split('_')[2:]) for s in df['a']]
is the equivalent of:
lst = []
for s in df['a']:
a = s.split('_')[2:] # List all strings in list of substrings splitted '_' besides the first 2
lst.append('_'.join(a))
Try:
df['b'] = df['a'].str.split('_',2).str[-1]
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
I have a column containing values. I want to split it based on a regex. If the regex matches, the original value will be replaced with the left-side of the split. A new column will contain the right-side of a split.
Below is some sample code. I feel I am close but it isn't quite working.
import pandas as pd
import re
df = pd.DataFrame({ 'A' : ["test123","foo"]})
// Regex example to split it if it ends in numbers
r = r"^(.+?)(\d*)$"
df['A'], df['B'] = zip(*df['A'].apply(lambda x: x.split(r, 1)))
print(df)
In the example above I would expect the following output
A B
0 test 123
1 foo
I am fairly new to Python and assumed this would be the way to go. However, it appears that I haven't quite hit the mark. Is anyone able to help me correct this example?
Just base on your own regex
df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
Out[158]:
1 2
0 test 123
1 foo
df[['A','B']]=df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
df
Out[160]:
A B
0 test 123
1 foo
Your regex is working just fine, use it with str.extract
df = pd.DataFrame({ 'A' : ["test123","foo", "12test3"]})
df[['A', 'B']] = df['A'].str.extract("^(.+?)(\d*)$", expand = True)
A B
0 test 123
1 foo
2 12test 3
def bar(x):
els = re.findall(r'^(.+?)(\d*)$', x)[0]
if len(els):
return els
else:
return x, None
def foo():
df = pd.DataFrame({'A': ["test123", "foo"]})
df['A'], df['B'] = zip(*df['A'].apply(bar))
print(df)
result:
A B
0 test 123
1 foo