Python Pandas: Dataframe is not updating using string methods - python

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')

You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa

import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:

Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

Related

Removing comma from values in column (csv file) using Python Pandas

I want to remove commas from a column named size.
CSV looks like below:
number name size
1 Car 9,32,123
2 Bike 1,00,000
3 Truck 10,32,111
I want the output as below:
number name size
1 Car 932123
2 Bike 100000
3 Truck 1032111
I am using python3 and Pandas module for handling this csv.
I am trying replace method but I don't get the desired output.
Snapshot from my code :
import pandas as pd
df = pd.read_csv("file.csv")
// df.replace(",","")
// df['size'] = df['size'].replace(to_replace = "," , value = "")
// df['size'] = df['size'].replace(",", "")
df['size'] = df['size'].replace({",", ""})
print(df['size']) // expecting to see 'size' column without comma
I don't see any error/exception. The last line print(df['size']) simply displays values as it is, ie, with commas.
With replace, we need regex=True because otherwise it looks for exact match in a cell, i.e., cells with , in them only:
>>> df["size"] = df["size"].replace(",", "", regex=True)
>>> df
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
I am using python3 and Pandas module for handling this csv
Note that pandas.read_csv function has optional argument thousands, if , are used for denoting thousands you might set thousands="," consider following example
import io
import pandas as pd
some_csv = io.StringIO('value\n"1"\n"1,000"\n"1,000,000"\n')
df = pd.read_csv(some_csv, thousands=",")
print(df)
output
value
0 1
1 1000
2 1000000
For brevity I used io.StringIO, same effect might be achieved providing name of file with same content as first argument in io.StringIO.
Try with str.replace instead:
df['size'] = df['size'].str.replace(',', '')
Optional convert to int with astype:
df['size'] = df['size'].str.replace(',', '').astype(int)
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
Sample Frame Used:
df = pd.DataFrame({'number': [1, 2, 3], 'name': ['Car', 'Bike', 'Truck'],
'size': ['9,32,123', '1,00,000', '10,32,111']})
number name size
0 1 Car 9,32,123
1 2 Bike 1,00,000
2 3 Truck 10,32,111

how to remove whitespace from string in pandas column

I need to remove whitespaces in pandas df column. My data looks like this:
industry magazine
Home "Goodhousekeeping.com"; "Prevention.com";
Fashion "Cosmopolitan"; " Elle"; "Vogue"
Fashion " Vogue"; "Elle"
Below is my code:
# split magazine column values, create a new column in df
df['magazine_list'] = dfl['magazine'].str.split(';')
# stip the first whitespace from strings
df.magazine_list = df.magazine_list.str.lstrip()
This returns all NaN, I have also tried:
df.magazine = df.magazine.str.lstrip()
This didn't remove the white spaces either.
Use list comprehension with strip of splitted values, also strip values before split for remove trailing ;, spaces and " values:
f = lambda x: [y.strip('" ') for y in x.strip(';" ').split(';')]
df['magazine_list'] = df['magazine'].apply(f)
print (df)
industry magazine \
0 Home Goodhousekeeping.com; "Prevention.com";
1 Fashion Cosmopolitan; " Elle"; "Vogue"
2 Fashion Vogue; "Elle
magazine_list
0 [Goodhousekeeping.com, Prevention.com]
1 [Cosmopolitan, Elle, Vogue]
2 [Vogue, Elle]
Jezrael provides a good solution. It is useful to know that pandas has string accessors for similar operations without the need of list comprehensions. Normally a list comprehension is faster, but depending on the use case using pandas built-in functions could be more readable or simpler to code.
df['magazine'] = (
df['magazine']
.str.replace(' ', '', regex=False)
.str.replace('"', '', regex=False)
.str.strip(';')
.str.split(';')
)
Output
industry magazine
0 Home [Goodhousekeeping.com, Prevention.com]
1 Fashion [Cosmopolitan, Elle, Vogue]
2 Fashion [Vogue, Elle]

python - Replace first five characters in a column with asterisks

I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425

replacing double pipes || in Pandas or Python

working through some badly thought out data which uses '||' as a delimiter within a single string. I had an excel file that is over 60 sheets and 100k individual records which has these '||' separating interests. for example:
email interests
info#test.com Sports||IT||Business||Other
I've tried using the following code to replace the pipes but it doesn't seem to work.. are the pipes considered a special character? A google search yielded no Python specific results for me.
import pandas as pd
df = pd.read_excel("test.xlsx")
df["interests"] = df["interests"].replace('||', ' , ')
using str.replace for some reason just adds in a load of commas between each individual character
any help would be greatly appreciated!
Series.replace(..., regex=False, ...) uses regex=False per default, which means it'll try to replace the whole cell value.
Demo:
In [25]: df = pd.DataFrame({'col':['ab ab', 'ab']})
In [26]: df
Out[26]:
col
0 ab ab
1 ab
In [27]: df['col'].replace('ab', 'XXX')
Out[27]:
0 ab ab # <--- NOTE!
1 XXX
Name: col, dtype: object
In [28]: df['col'].replace('ab', 'ZZZ', regex=True)
Out[28]:
0 ZZZ ZZZ
1 ZZZ
Name: col, dtype: object
So don't forget to use regex=True parameter:
In [23]: df["interests"] = df["interests"].replace('\|\|', ' , ', regex=True)
In [24]: df
Out[24]:
email interests
0 info#test.com Sports , IT , Business , Other
or use Series.str.replace() which always treats it as RegEx:
df["interests"] = df["interests"].str.replace('\|\|', ' , ')
PS beside that | is a special RegEx symbol, which means OR, so we need to escape it with a back-slash character

Replacing punctuation in a data frame based on punctuation list [duplicate]

This question already has answers here:
Fast punctuation removal with pandas
(4 answers)
Closed 4 years ago.
Using Canopy and Pandas, I have data frame a which is defined by:
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"]
test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.
Assuming df looks like:
test
%hgh&12
abc123!!!
porkyfries
I want my results to be:
test
hgh12
abc123
porkyfries
Effort so far:
from string import punctuation /-- import punctuation list from python itself
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"] /-- define the dataframe
for p in list(punctuation):
...: df2=df.med.str.replace(p,'')
...: df2=pd.DataFrame(df2);
...: df2
The command above basically just returns me with the same data set.
Appreciate any leads.
Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows.
Long story short, I need to clean data in a very efficient manner for big data sets.
Use replace with correct regex would be easier:
In [41]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
text
0 test
1 %hgh&12
2 abc123!!!
3 porkyfries
[4 rows x 1 columns]
use regex with the pattern which means not alphanumeric/whitespace
In [49]:
df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
text
0 test
1 hgh12
2 abc123
3 porkyfries
[4 rows x 1 columns]
For removing punctuation from a text column in your dataframme:
In:
import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)
pattern
Out:
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]'
In:
df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df
Out:
text
0 book...regh
1 book...
2 boo,
3 book.
4 ball,
5 ballnroll"
6 "rope"
7 rick %
In:
df['text'] = df['text'].str.replace(pattern, '')
df
You can replace the pattern with your desired character. Ex - replace(pattern, '$')
Out:
text
0 bookregh
1 book
2 boo
3 book
4 ball
5 ballnroll
6 rope
7 rick
Translate is often considered the cleanest and fastest way to remove punctuation (source)
import string
text = text.translate(None, string.punctuation.translate(None, '"'))
You may find that it works better to remove punctuation in 'a' before loading it into pandas.

Categories