pandas dataframe replace multiple substring of column

pandas dataframe replace multiple substring of column - python

I have below the code
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
pattern = ','.join(['$', ','])
df['A'] = df['A'].str.replace('$|,', '', regex=True)
print(df['A'])
What I am trying to remove every occurrence of '$' or ','... so I am trying to replace with blank..
But its replacing only ,
Output I am getting
0 $5756
1 3434
2 $45
3 1344$
it should be
0 5756
1 3434
2 45
3 1344
What I am doing wrong
Any help appreciated
Thanks

Use:
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
df['A'] = df['A'].str.replace('[$,]', '', regex=True)
print(df)
Output
A
0 5756
1 3434
2 45
3 1344
The problem is that the character $ has a special meaning in regular expressions. From the documentation (emphasis mine):
$
Matches the end of the string or just before the newline at the end
of the string, and in MULTILINE mode also matches before a newline.
foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$
matches only ‘foo’. More interestingly, searching for foo.$ in
'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode;
searching for a single $ in 'foo\n' will find two (empty) matches: one
just before the newline, and one at the end of the string.mode;
searching for a single $ in 'foo\n' will find two (empty) matches: one
just before the newline, and one at the end of the string.
So you need to escape the character or put it inside a character class.
As an alternative use:
df['A'].str.replace('\$|,', '', regex=True) # note the escaping \

If you only have integer-like numbers an easy option is to remove all but digits \D, then you don't have to deal with other special regex characters like $:
df['A'] = df['A'].str.replace(r'\D', '', regex=True)
output:
A
0 5756
1 3434
2 45
3 1344

It might be useful for you:
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
df['A'] = df['A'].str.replace('$', '', regex=True)
print(df['A'])

Related

Python insert space between numbers and characters in a column

I have a dataframe with multiple columns and I want to separate the numbers from the letters with a space in one column.
In this example I want to add space in the third column.
do you know how to do so?
import pandas as pd
data = {'first_column': ['first_value', 'second_value', 'third_value'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_column':['AA6589', 'GG6589', 'BXV6589'],
'fourth_column':['first_value', 'second_value', 'third_value'],
}
df = pd.DataFrame(data)
print (df)

Use str.replace with a short regex:
df['third_column'] = df['third_column'].str.replace(r'(\D+)(\d+)',
r'\1 \2', regex=True)
regex:
(\D+) # capture one or more non-digits
(\d+) # capture one or more digits
replace with \1 \2 (first captured group, then space, then second captured group).
Alternative with lookarounds:
df['third_column'] = df['third_column'].str.replace(r'(?<=\D)(?=\d)',
' ', regex=True)
meaning: insert a space at any position in-between a non-digit and a digit.

Similarly you could extract the digits and non digit characters from your 'third_column' and place them together with a space in between:
df.assign(
third_column=df["third_column"].str.extract(r'(\D+)') + " " + df["third_column"].str.extract(r'(\d+)')
)
first_column second_column third_column fourth_column
0 first_value first_value AA 6589 first_value
1 second_value second_value GG 6589 second_value
2 third_value third_value BXV 6589 third_value

how to remove whitespace from string in pandas column

I need to remove whitespaces in pandas df column. My data looks like this:
industry magazine
Home "Goodhousekeeping.com"; "Prevention.com";
Fashion "Cosmopolitan"; " Elle"; "Vogue"
Fashion " Vogue"; "Elle"
Below is my code:
# split magazine column values, create a new column in df
df['magazine_list'] = dfl['magazine'].str.split(';')
# stip the first whitespace from strings
df.magazine_list = df.magazine_list.str.lstrip()
This returns all NaN, I have also tried:
df.magazine = df.magazine.str.lstrip()
This didn't remove the white spaces either.

Use list comprehension with strip of splitted values, also strip values before split for remove trailing ;, spaces and " values:
f = lambda x: [y.strip('" ') for y in x.strip(';" ').split(';')]
df['magazine_list'] = df['magazine'].apply(f)
print (df)
industry magazine \
0 Home Goodhousekeeping.com; "Prevention.com";
1 Fashion Cosmopolitan; " Elle"; "Vogue"
2 Fashion Vogue; "Elle
magazine_list
0 [Goodhousekeeping.com, Prevention.com]
1 [Cosmopolitan, Elle, Vogue]
2 [Vogue, Elle]

Jezrael provides a good solution. It is useful to know that pandas has string accessors for similar operations without the need of list comprehensions. Normally a list comprehension is faster, but depending on the use case using pandas built-in functions could be more readable or simpler to code.
df['magazine'] = (
df['magazine']
.str.replace(' ', '', regex=False)
.str.replace('"', '', regex=False)
.str.strip(';')
.str.split(';')
)
Output
industry magazine
0 Home [Goodhousekeeping.com, Prevention.com]
1 Fashion [Cosmopolitan, Elle, Vogue]
2 Fashion [Vogue, Elle]

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')

You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa

import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:

Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

replacing double pipes || in Pandas or Python

working through some badly thought out data which uses '||' as a delimiter within a single string. I had an excel file that is over 60 sheets and 100k individual records which has these '||' separating interests. for example:
email interests
info#test.com Sports||IT||Business||Other
I've tried using the following code to replace the pipes but it doesn't seem to work.. are the pipes considered a special character? A google search yielded no Python specific results for me.
import pandas as pd
df = pd.read_excel("test.xlsx")
df["interests"] = df["interests"].replace('||', ' , ')
using str.replace for some reason just adds in a load of commas between each individual character
any help would be greatly appreciated!

Series.replace(..., regex=False, ...) uses regex=False per default, which means it'll try to replace the whole cell value.
Demo:
In [25]: df = pd.DataFrame({'col':['ab ab', 'ab']})
In [26]: df
Out[26]:
col
0 ab ab
1 ab
In [27]: df['col'].replace('ab', 'XXX')
Out[27]:
0 ab ab # <--- NOTE!
1 XXX
Name: col, dtype: object
In [28]: df['col'].replace('ab', 'ZZZ', regex=True)
Out[28]:
0 ZZZ ZZZ
1 ZZZ
Name: col, dtype: object
So don't forget to use regex=True parameter:
In [23]: df["interests"] = df["interests"].replace('\|\|', ' , ', regex=True)
In [24]: df
Out[24]:
email interests
0 info#test.com Sports , IT , Business , Other
or use Series.str.replace() which always treats it as RegEx:
df["interests"] = df["interests"].str.replace('\|\|', ' , ')
PS beside that | is a special RegEx symbol, which means OR, so we need to escape it with a back-slash character

pandas: Replace string is not replacing targeted substring

I am trying to iterate a list of strings using dataframe1 to check whether the other dataframe2 has any strings found in dataframe1 to replace them.
for index, row in nlp_df.iterrows():
print( row['x1'] )
string1 = row['x1'].replace("(","\(")
string1 = string1.replace(")","\)")
string1 = string1.replace("[","\[")
string1 = string1.replace("]","\]")
nlp2_df['title'] = nlp2_df['title'].replace(string1,"")
In order to do this I iterated using the code shown above to check and replace for any string found in df1
The output belows shows the strings in df1
wait_timeout
interactive_timeout
pool_recycle
....
__all__
folder_name
re.compile('he(lo')
The output below shows the output after replacing strings in df2
0 have you tried watching the traffic between th...
1 /dev/cu.xxxxx is the "callout" device, it's wh...
2 You'll want the struct package.\r\r\n
For the output in df2 strings like /dev/cu.xxxxx should have been replaced during the iteration but as shown it is not removed. However, I have attempted using nlp2_df['title'] = nlp2_df['title'].replace("/dev/cu.xxxxx","") and managed to remove it successfully.
Is there a reason why directly writing the string works but looping using a variable to use for replacing does not?

IIUC you can simply use regular expressions:
nlp2_df['title'] = nlp2_df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
PS you don't need for loop at all...
Demo:
In [15]: df
Out[15]:
title
0 aaa (bbb) ccc
1 A [word] ...
In [16]: df['new'] = df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
In [17]: df
Out[17]:
title new
0 aaa (bbb) ccc aaa \(bbb\) ccc
1 A [word] ... A \[word\] ...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe replace multiple substring of column - python

If you only have integer-like numbers an easy option is to remove all but digits \D, then you don't have to deal with other special regex characters like $: df['A'] = df['A'].str.replace(r'\D', '', regex=True) output: A 0 5756 1 3434 2 45 3 1344

It might be useful for you: import pandas as pd df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']}) df['A'] = df['A'].str.replace('$', '', regex=True) print(df['A'])

Related

Python insert space between numbers and characters in a column

how to remove whitespace from string in pandas column

Python Pandas: Dataframe is not updating using string methods

replacing double pipes || in Pandas or Python

pandas: Replace string is not replacing targeted substring

Categories

Resources