working through some badly thought out data which uses '||' as a delimiter within a single string. I had an excel file that is over 60 sheets and 100k individual records which has these '||' separating interests. for example:
email interests
info#test.com Sports||IT||Business||Other
I've tried using the following code to replace the pipes but it doesn't seem to work.. are the pipes considered a special character? A google search yielded no Python specific results for me.
import pandas as pd
df = pd.read_excel("test.xlsx")
df["interests"] = df["interests"].replace('||', ' , ')
using str.replace for some reason just adds in a load of commas between each individual character
any help would be greatly appreciated!
Series.replace(..., regex=False, ...) uses regex=False per default, which means it'll try to replace the whole cell value.
Demo:
In [25]: df = pd.DataFrame({'col':['ab ab', 'ab']})
In [26]: df
Out[26]:
col
0 ab ab
1 ab
In [27]: df['col'].replace('ab', 'XXX')
Out[27]:
0 ab ab # <--- NOTE!
1 XXX
Name: col, dtype: object
In [28]: df['col'].replace('ab', 'ZZZ', regex=True)
Out[28]:
0 ZZZ ZZZ
1 ZZZ
Name: col, dtype: object
So don't forget to use regex=True parameter:
In [23]: df["interests"] = df["interests"].replace('\|\|', ' , ', regex=True)
In [24]: df
Out[24]:
email interests
0 info#test.com Sports , IT , Business , Other
or use Series.str.replace() which always treats it as RegEx:
df["interests"] = df["interests"].str.replace('\|\|', ' , ')
PS beside that | is a special RegEx symbol, which means OR, so we need to escape it with a back-slash character
Related
I have below the code
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
pattern = ','.join(['$', ','])
df['A'] = df['A'].str.replace('$|,', '', regex=True)
print(df['A'])
What I am trying to remove every occurrence of '$' or ','... so I am trying to replace with blank..
But its replacing only ,
Output I am getting
0 $5756
1 3434
2 $45
3 1344$
it should be
0 5756
1 3434
2 45
3 1344
What I am doing wrong
Any help appreciated
Thanks
Use:
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
df['A'] = df['A'].str.replace('[$,]', '', regex=True)
print(df)
Output
A
0 5756
1 3434
2 45
3 1344
The problem is that the character $ has a special meaning in regular expressions. From the documentation (emphasis mine):
$
Matches the end of the string or just before the newline at the end
of the string, and in MULTILINE mode also matches before a newline.
foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$
matches only ‘foo’. More interestingly, searching for foo.$ in
'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode;
searching for a single $ in 'foo\n' will find two (empty) matches: one
just before the newline, and one at the end of the string.mode;
searching for a single $ in 'foo\n' will find two (empty) matches: one
just before the newline, and one at the end of the string.
So you need to escape the character or put it inside a character class.
As an alternative use:
df['A'].str.replace('\$|,', '', regex=True) # note the escaping \
If you only have integer-like numbers an easy option is to remove all but digits \D, then you don't have to deal with other special regex characters like $:
df['A'] = df['A'].str.replace(r'\D', '', regex=True)
output:
A
0 5756
1 3434
2 45
3 1344
It might be useful for you:
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
df['A'] = df['A'].str.replace('$', '', regex=True)
print(df['A'])
I need to remove whitespaces in pandas df column. My data looks like this:
industry magazine
Home "Goodhousekeeping.com"; "Prevention.com";
Fashion "Cosmopolitan"; " Elle"; "Vogue"
Fashion " Vogue"; "Elle"
Below is my code:
# split magazine column values, create a new column in df
df['magazine_list'] = dfl['magazine'].str.split(';')
# stip the first whitespace from strings
df.magazine_list = df.magazine_list.str.lstrip()
This returns all NaN, I have also tried:
df.magazine = df.magazine.str.lstrip()
This didn't remove the white spaces either.
Use list comprehension with strip of splitted values, also strip values before split for remove trailing ;, spaces and " values:
f = lambda x: [y.strip('" ') for y in x.strip(';" ').split(';')]
df['magazine_list'] = df['magazine'].apply(f)
print (df)
industry magazine \
0 Home Goodhousekeeping.com; "Prevention.com";
1 Fashion Cosmopolitan; " Elle"; "Vogue"
2 Fashion Vogue; "Elle
magazine_list
0 [Goodhousekeeping.com, Prevention.com]
1 [Cosmopolitan, Elle, Vogue]
2 [Vogue, Elle]
Jezrael provides a good solution. It is useful to know that pandas has string accessors for similar operations without the need of list comprehensions. Normally a list comprehension is faster, but depending on the use case using pandas built-in functions could be more readable or simpler to code.
df['magazine'] = (
df['magazine']
.str.replace(' ', '', regex=False)
.str.replace('"', '', regex=False)
.str.strip(';')
.str.split(';')
)
Output
industry magazine
0 Home [Goodhousekeeping.com, Prevention.com]
1 Fashion [Cosmopolitan, Elle, Vogue]
2 Fashion [Vogue, Elle]
I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line
I am trying to iterate a list of strings using dataframe1 to check whether the other dataframe2 has any strings found in dataframe1 to replace them.
for index, row in nlp_df.iterrows():
print( row['x1'] )
string1 = row['x1'].replace("(","\(")
string1 = string1.replace(")","\)")
string1 = string1.replace("[","\[")
string1 = string1.replace("]","\]")
nlp2_df['title'] = nlp2_df['title'].replace(string1,"")
In order to do this I iterated using the code shown above to check and replace for any string found in df1
The output belows shows the strings in df1
wait_timeout
interactive_timeout
pool_recycle
....
__all__
folder_name
re.compile('he(lo')
The output below shows the output after replacing strings in df2
0 have you tried watching the traffic between th...
1 /dev/cu.xxxxx is the "callout" device, it's wh...
2 You'll want the struct package.\r\r\n
For the output in df2 strings like /dev/cu.xxxxx should have been replaced during the iteration but as shown it is not removed. However, I have attempted using nlp2_df['title'] = nlp2_df['title'].replace("/dev/cu.xxxxx","") and managed to remove it successfully.
Is there a reason why directly writing the string works but looping using a variable to use for replacing does not?
IIUC you can simply use regular expressions:
nlp2_df['title'] = nlp2_df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
PS you don't need for loop at all...
Demo:
In [15]: df
Out[15]:
title
0 aaa (bbb) ccc
1 A [word] ...
In [16]: df['new'] = df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
In [17]: df
Out[17]:
title new
0 aaa (bbb) ccc aaa \(bbb\) ccc
1 A [word] ... A \[word\] ...