split the string in dataframe in python - python

I have a data-frame and one of its columns are a string which separated with dash. I want to get the part before the dash. Could you help me with that?
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 5]
df['b'] = ['C-C02','R-C05','R-C01','C-C06', 'RC-C06']
The desire output is:

You could use str.replace to remove the - and all characters after it:
df['b'] = df['b'].str.replace(r'-.*$', '', regex=True)
Output:
a b
0 1 C
1 2 R
2 3 R
3 4 C
4 5 RC

You want to split each string on the '-' character and keep the part before it:
df['c'] = [s.split('-')[0] for s in df['b']]

Related

pandas split column by some specific words and keep this delimiter

I want to split the column by some specific words and keep the delimiter in the same time. I tried to split the column with str.split but the result isn't I want.
example data(test.csv):
a
abc123and321abcor213cba
abc321or123cbaand321cba
my code:
import pandas as pd
df = pd.read_csv('test.csv')
df[['b','c']] = df['a'].str.split("and",1,expand=True)
df[['c','d']] = df['c'].str.split("or",1,expand=True)
print(df)
my result:
a b c d
0 abc123and321abcor213cba abc123 321abc 213cba
1 abc321or123cbaand321cba abc321or123cba 321cba None
Desired result:
a b c d
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
How can I do this?
Borrowing from Tim's answer using a lookbehind Regex to split on and or or, without using up the seperating string in the split:
d = {'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]}
df = pandas.DataFrame(data=d)
df[["b", "c", "d"]] = df['a'].str.split(r'(?<=and)|(?<=or)', expand=True)
Output:
a b c d
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
If you have issues with split check that you don't have a too old pandas version.
You could also use str.extractall and unstack. I'll also recommend to usejoin to add the columns if you don't know in advance the number of matches/columns?
df = pd.DataFrame({'a': ["abc123and321abcor213cba", "abc321or123cbaand321cba"]})
df.join(df['a'].str.extractall(r'(.*?(?:and|or)|.+$)')[0].unstack('match'))
Output:
a 0 1 2
0 abc123and321abcor213cba abc123and 321abcor 213cba
1 abc321or123cbaand321cba abc321or 123cbaand 321cba
Try splitting on the lookbehind (?<=and|or):
df[['b', 'c', 'd']] = df['a'].str.split(r'(?<=and)|(?<=or)', 1, expand=True)

Data Preprocessing in Python using Pandas

I am trying to preprocess one of my columns in my Data frame. The issue is that I have [[ content1] , [content2], [content3]] in the relations column. I want to remove the Brackets
i have tried this following:
df['value'] = df['value'].str[0]
the output that i get is
[content 1]
df
print df
id value
1 [[str1],[str2],[str3]]
2 [[str4],[str5]]
3 [[str1]]
4 [[str8]]
5 [[str9]]
6 [[str4]]
the expected output should be like
id value
1 str1,str2,str3
2 str4,str5
3 str1
4 str8
5 str9
6 str4
It looks like you have lists of lists. You can try to unnest and join:
df['value'] = df['value'].apply(lambda x: ','.join([e for l in x for e in l]))
Or:
from itertools import chain
df['value'] = df['value'].apply(lambda x: ','.join(chain.from_iterable(x)))
NB. If you get an error, please provide it and the type of the column (df.dtypes)
As I could see, your data and sampling the same:
Sample Data:
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':['[[str1],[str2],[str3]]', '[[str4],[str5]]', '[[str1]]', '[[str8]]', '[[str9]]', '[[str4]]']})
print(df)
id value
0 1 [[str1],[str2],[str3]]
1 2 [[str4],[str5]]
2 3 [[str1]]
3 4 [[str8]]
4 5 [[str9]]
5 6 [[str4]]
Result:
df['value'] = df['value'].str.replace('[', '').astype(str).str.replace(']', '')
print(df)
id value
0 1 str1,str2,str3
1 2 str4,str5
2 3 str1
3 4 str8
4 5 str9
5 6 str4
Note: as the error code says AttributeError: Can only use .str accessor with string values which means it's not treating it as str hence you may cast it to str by astype(str) and then do the replace operation.
You can use useful regex python package re.
This is the solution.
import pandas as pd
import re
make the test data
data = [
[1, '[[str1],[str2],[str3]]'],
[2, '[[str4],[str5]]'],
[3, '[[str1]]'],
[4, '[[str8]]'],
[5, '[[str9]]'],
[6, '[[str4]]']
]
conver data to Dataframe
df = pd.DataFrame(data, columns = ['id', 'value'])
print(df)
remove '[', ']' from the 'value' column
df['value']=df.apply(lambda x: re.sub("[\[\]]", "", x['value']),axis=1)
print(df)

Replace column names with quotations with no quotations

I am trying to replace my column names that have quotations and simply remove the quotations but when I try this:
for x in df.columns:
x = x.replace('"', '')
print(x)
Nothing happens and the quotations are still there.
I would do something like this
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
CODE
import pandas as pd
df=pd.DataFrame({"a":[1,2],'"b"':[3,4]})
print('BEFORE')
print(df)
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
print('AFTER')
print(df)
OUTPUT
BEFORE
a "b"
0 1 3
1 2 4
AFTER
a b
0 1 3
1 2 4
you can remove it by writing the following code:
col=[]
for x in df.columns:
x = x.replace('"', '')
col.append(x)
df.columns=col
To know more about column renaming: Check this Renaming columns in pandas
One canonical solution to this problem is using pandas str.replace on the header directly (this is "vectorized"):
df = pd.DataFrame({"a": [1, 2], '"b"': [3, 4]})
df.columns = df.columns.str.replace('"', '')
df
a b
0 1 3
1 2 4

Python DataFrame : Split data in rows based on custom value?

I have a dataframe with column a. I need to get data after second _.
a
0 abc_def12_0520_123
1 def_ghij123_0120_456
raw_data = {'a': ['abc_def12_0520_123', 'def_ghij123_0120_456']}
df = pd.DataFrame(raw_data, columns = ['a'])
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
What I have tried:
df['b'] = df.number.str.replace('\D+', '')
I tried removing alphabets first, But its getting complex. Any suggestions
Here is how:
df['b'] = ['_'.join(s.split('_')[2:]) for s in df['a']]
print(df)
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
Explanation:
lst = ['_'.join(s.split('_')[2:]) for s in df['a']]
is the equivalent of:
lst = []
for s in df['a']:
a = s.split('_')[2:] # List all strings in list of substrings splitted '_' besides the first 2
lst.append('_'.join(a))
Try:
df['b'] = df['a'].str.split('_',2).str[-1]
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456

Pandas: how to find and concatenate values

I'm trying to replace and add some values in pandas dataframe object. I have to following code
import pandas as pd
df = pd.DataFrame.from_items([('A', ["va-lue", "value-%", "value"]), ('B', [4, 5, 6])])
print df
df['A'] = df['A'].str.replace('%', '_0')
print df
df['A'] = df['A'].str.replace('-', '')
print df
#allmost there?
df.A[df['A'].str.contains('-')] + "_0"
How can I find the cell values in column A which contains '-' sign, replace this value with '' and add for these values a trailing '_0'? The resulting data set should look like this
A B
0 value_0 4
1 value_0 5
2 value 6
You can first keep track of the rows whose A needs to be appended with the trailing string, and perform these operations in two steps:
mask = df['A'].str.contains('-')
df['A'] = df['A'].str.replace('-|%', '')
df.ix[mask, 'A'] += '_0'
print df
Output:
A B
0 value_0 4
1 value_0 5
2 value 6

Categories