I have a column with the headlines of articles. The headlines are like this(in Greek):
[\n, [Μητσοτάκης: Έχει μεγάλη σημασία οι φωτισ..
How can I remove this character: [\n, ?
I have tried this but nothing happened:
df['Title'].replace('\n', '', regex=True)
.replace() does not change the dataframe by default, it returns a new dataframe. Use the inplace pararameter.
>>> import pandas
>>> df = pandas.DataFrame([{"x": "a\n"}, {"x": "b\n"}, {"x": "c\n"}])
>>> df['x'].replace('\n', '', regex=True) # does not change df
0 a
1 b
2 c
Name: x, dtype: object
>>> df # df is unchanged
x
0 a\n
1 b\n
2 c\n
>>> df['x'].replace('\n', '', regex=True, inplace=True)
>>> df # df is changed
x
0 a
1 b
2 c
You're looking for
df['Title'].str.replace('\n', '')
Also remember that this replacement doesn't happen in-place. To change the original dataframe, you're going to have to do
df['Title'] = df['Title'].str.replace('\n', '')
df.str provides vectorized string functions to operate on each value in the column. df.str.replace('\n', '') runs the str.replace() function on each element of df.
df.replace() replaces entire values in the column with the given replacement.
For example,
data = [{"x": "hello\n"}, {"x": "yello\n"}, {"x": "jello\n"}]
df = pd.DataFrame(data)
# df:
# x
# 0 hello\n
# 1 yello\n
# 2 jello\n
df["x"].str.replace('\n', '')
# df["x"]:
# 0 hello
# 1 yello
# 2 jello
df["x"].replace('yello\n', 'bello\n')
# df["x"]:
# 0 hello\n
# 1 bello\n
# 2 jello\n
Related
I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B
You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)
I have a dataframe with column a. I need to get data after second _.
a
0 abc_def12_0520_123
1 def_ghij123_0120_456
raw_data = {'a': ['abc_def12_0520_123', 'def_ghij123_0120_456']}
df = pd.DataFrame(raw_data, columns = ['a'])
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
What I have tried:
df['b'] = df.number.str.replace('\D+', '')
I tried removing alphabets first, But its getting complex. Any suggestions
Here is how:
df['b'] = ['_'.join(s.split('_')[2:]) for s in df['a']]
print(df)
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
Explanation:
lst = ['_'.join(s.split('_')[2:]) for s in df['a']]
is the equivalent of:
lst = []
for s in df['a']:
a = s.split('_')[2:] # List all strings in list of substrings splitted '_' besides the first 2
lst.append('_'.join(a))
Try:
df['b'] = df['a'].str.split('_',2).str[-1]
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
I have a column containing values. I want to split it based on a regex. If the regex matches, the original value will be replaced with the left-side of the split. A new column will contain the right-side of a split.
Below is some sample code. I feel I am close but it isn't quite working.
import pandas as pd
import re
df = pd.DataFrame({ 'A' : ["test123","foo"]})
// Regex example to split it if it ends in numbers
r = r"^(.+?)(\d*)$"
df['A'], df['B'] = zip(*df['A'].apply(lambda x: x.split(r, 1)))
print(df)
In the example above I would expect the following output
A B
0 test 123
1 foo
I am fairly new to Python and assumed this would be the way to go. However, it appears that I haven't quite hit the mark. Is anyone able to help me correct this example?
Just base on your own regex
df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
Out[158]:
1 2
0 test 123
1 foo
df[['A','B']]=df.A.str.split(r,expand=True).replace('',np.nan).dropna(thresh=1,axis=1).fillna('')
df
Out[160]:
A B
0 test 123
1 foo
Your regex is working just fine, use it with str.extract
df = pd.DataFrame({ 'A' : ["test123","foo", "12test3"]})
df[['A', 'B']] = df['A'].str.extract("^(.+?)(\d*)$", expand = True)
A B
0 test 123
1 foo
2 12test 3
def bar(x):
els = re.findall(r'^(.+?)(\d*)$', x)[0]
if len(els):
return els
else:
return x, None
def foo():
df = pd.DataFrame({'A': ["test123", "foo"]})
df['A'], df['B'] = zip(*df['A'].apply(bar))
print(df)
result:
A B
0 test 123
1 foo
I have a dataframe with column 'A' as a string. Within 'A' there are values like
name1-L89783
nametwo-L33009
I would like to make a new column 'B' such that the '-Lxxxx' is removed and all that remains is 'name1' and 'nametwo'.
use vectorised str.split for this and then use str again to access the array element of interest in this case the first element of the split:
In [10]:
df[1] = df[0].str.split('-').str[0]
df
Out[10]:
0 1
0 name1-L89783 name1
1 nametwo-L33009 nametwo
Initialize DataFrame.
df = pd.DataFrame(['name1-L89783','nametwo-L33009'],columns=['A',])
>>> df
A
0 name1-L89783
1 nametwo-L33009
Apply function over rows and put the result in a new column.
df['B'] = df['A'].apply(lambda x: x.split('-')[0])
>>> df
A B
0 name1-L89783 name1
1 nametwo-L33009 nametwo
I'm trying to replace and add some values in pandas dataframe object. I have to following code
import pandas as pd
df = pd.DataFrame.from_items([('A', ["va-lue", "value-%", "value"]), ('B', [4, 5, 6])])
print df
df['A'] = df['A'].str.replace('%', '_0')
print df
df['A'] = df['A'].str.replace('-', '')
print df
#allmost there?
df.A[df['A'].str.contains('-')] + "_0"
How can I find the cell values in column A which contains '-' sign, replace this value with '' and add for these values a trailing '_0'? The resulting data set should look like this
A B
0 value_0 4
1 value_0 5
2 value 6
You can first keep track of the rows whose A needs to be appended with the trailing string, and perform these operations in two steps:
mask = df['A'].str.contains('-')
df['A'] = df['A'].str.replace('-|%', '')
df.ix[mask, 'A'] += '_0'
print df
Output:
A B
0 value_0 4
1 value_0 5
2 value 6