python - Replace first five characters in a column with asterisks

python - Replace first five characters in a column with asterisks - python

I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?

Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])

You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)

Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]

You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425

Related

Multi-part manipulation post str.split() Pandas

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.

One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

How to create a new string by combining values corresponding to keys in several dictionaries in Python?

I have two dictionaries:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
and a table consisting of one single column where bond names are contained:
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
I need to replace the name with a string of the following format: EUA21 where the first two letters are the corresponding value to the currency key in the dictionary, the next letter is the value corresponding to the month key and the last two digits are the year from the name.
I tried to split the name using the following code:
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
but I am not sure how to proceed from here to create the string as I need to search the dictionaries at the same time for the currency and month extract the values join them and add the year from the name onto it.

This will give you a list of what you need:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = {'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']}
result = []
for names in bond_names['Names']:
bond = names.split('.')
result.append(currency[bond[1]] + time[bond[2]] + bond[3])
print(result)

You can do that like this:
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency = {'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = pd.DataFrame({'Names': ['Bond.USD.JAN.21', 'Bond.USD.MAR.25', 'Bond.EUR.APR.22', 'Bond.HUF.JUN.21', 'Bond.HUF.JUL.23', 'Bond.GBP.JAN.21']})
bond_names['Names2'] = bond_names['Names'].apply(lambda x: currency[x[5:8]] + time[x[9:12]] + x[-2:])
print(bond_names['Names2'])
# 0 USA21
# 1 USC25
# 2 EUD22
# 3 HFF21
# 4 HFH23
# 5 GBA21
# Name: Names2, dtype: object

With extended regex substitution:
In [42]: bond_names['Names'].str.replace(r'^[^.]+\.([^.]+)\.([^.]+)\.(\d+)', lambda m: '{}{}{}'.format(curre
...: ncy.get(m.group(1), m.group(1)), time.get(m.group(2), m.group(2)), m.group(3)))
Out[42]:
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
Name: Names, dtype: object

You can try this :
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
for idx, bond in enumerate(bond_names['Names']):
currencyID = currency.get(bond[1])
monthID = time.get(bond[2])
yearID = bond[3]
bond_names['Names'][idx] = currencyID + monthID + yearID
Output
Names
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21

how do i seperate two numbers in pandas

how do I separate two numbers if the length of the number is 18 . what I exactly wanna do is I want to separate mobile number(10) and landline number(8) when they are joined(18).
I have tried to extract first 8 numbers but I don't know how to apply condition. and I need to remove first 8 numbers if the condition satisfies
df['Landline'] = df['Number'].str[:8]
I have tried this but I know its wrong
df['Landline'] = df['Number'].apply(lambda x : x.str[:8] if len(x)==18 )

For extracting first 8 numbers, use findall.
df['Number'].str.findall('^\d{8}')
Solution using an example
Here we use the Dummy Data made in the following section.
# separate landline and mobile numbers
phone_numbers = df.Numbers.str.findall('(^\d{8})*(\d{10})').tolist()
# store in a dict
d = dict((i, {'Landline': e[0][0], 'Mobile': e[0][1]}) for i, e in enumerate(phone_numbers))
# make a temporary dataframe
phone_df = pd.DataFrame(d).T
# update original dataframe
df['Landline'] = phone_df['Landline']
df['Mobile'] = phone_df['Mobile']
print(df)
Output:
Numbers Landline Mobile
0 123456780123456789 12345678 0123456789
1 0123456789 0123456789
Dummy Data
df = pd.DataFrame({'Numbers': ['123456780123456789', '0123456789', ]})
print(df)
Output:
Numbers
0 123456780123456789
1 0123456789

Looks like you need
df['Landline'] = df['Number'].apply(lambda x : x[:8] if len(x)==18 else x)

Python: Add 0/Zero in a string inside a cell

I have this sample data in a cell:
EmployeeID
2016-CT-1028
2016-CT-1028
2017-CT-1063
2017-CT-1063
2015-CT-948
2015-CT-948
So, my problem is how can I add 0 inside this data 2015-CT-948 to
make it like this 2015-CT-0948.
I tried this code:
pattern = re.compile(r'(\d\d+)-(\w\w)-(\d\d\d)')
newlist = list(filter(pattern.match, idList))
Just to get the match regex pattern then add the 0 with zfill() but its not working. Please, can someone give me an idea on how can I do it. Is there anyway I can do it in regex or in pandas. Thank you!

This is one approach using zfill
Ex:
import pandas as pd
def custZfill(val):
val = val.split("-")
#alternative split by last -
#val = val.rsplit("-",1)
val[-1] = val[-1].zfill(4)
return "-".join(val)
df = pd.DataFrame({"EmployeeID": ["2016-CT-1028", "2016-CT-1028",
"2017-CT-1063", "2017-CT-1063",
"2015-CT-948", "2015-CT-948"]})
print(df["EmployeeID"].apply(custZfill))
Output:
0 2016-CT-1028
1 2016-CT-1028
2 2017-CT-1063
3 2017-CT-1063
4 2015-CT-0948
5 2015-CT-0948
Name: EmployeeID, dtype: object

With pandas it can be solved with split instead of regex:
df['EmployeeID'].apply(lambda x: '-'.join(x.split('-')[:-1] + [x.split('-')[-1].zfill(4)]))

In pandas, you could use str.replace
df['EmployeeID'] = df.EmployeeID.str.replace(r'-(\d{3})$', r'-0\1', regex=True)
# Output:
0 2016-CT-1028
1 2016-CT-1028
2 2017-CT-1063
3 2017-CT-1063
4 2015-CT-0948
5 2015-CT-0948
Name: EmployeeID, dtype: object

if the format of the id's is strictly defined, you can also use a simple list comprehension to do this job:
ids = [
'2017-CT-1063',
'2015-CT-948',
'2015-CT-948'
]
new_ids = [id if len(id) == 12 else id[0:8]+'0'+id[8:] for id in ids]
print(new_ids)
# ['2017-CT-1063', '2015-CT-0948', '2015-CT-0948']

Here's a one liner:
df['EmployeeID'].apply(lambda x: '-'.join(xi if i != 2 else '%04d' % int(xi) for i, xi in enumerate(x.split('-'))))

How to extract content from the regex output which has square bracket in python

I have a Python's (2.7) Pandas DF which has columns which looks something like this :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com']
I want to extract email from it without the square bracket and single quotes. Output should like this :
email
jsaw#yahoo.com
jfsjhj#yahoo.com
jwrk#yahoo.com
rankw#yahoo.com
I have tried the suggestions from this answer :Replace all occurrences of a string in a pandas dataframe (Python) . But its not working. Any help will be appreciated.
edit:
What if I have array of more than 1 dimension. something like :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com','fsffsnl#gmail.com']
['mklcu#yahoo.com','riserk#gmail.com', 'funkdl#yahoo.com']
is it possible to get the output in three different columns without square brackets and single quotes.

You can use str.strip if type of values is string:
print type(df.at[0,'email'])
<type 'str'>
df['email'] = df.email.str.strip("[]'")
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
If type is list apply Series:
print type(df.at[0,'email'])
<type 'list'>
df['email'] = df.email.apply(pd.Series)
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
EDIT: If you have multiple values in array, you can use:
df1 = df['email'].apply(pd.Series).fillna('')
print df1
0 1 2
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com fsffsnl#gmail.com
4 mklcu#yahoo.com riserk#gmail.com funkdl#yahoo.com

Try this one:
from re import findall
s = "['rankw#yahoo.com']"
m = findall(r"\[([A-Za-z0-9#'._]+)\]", s)
print(m[0].replace("'",''))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - Replace first five characters in a column with asterisks - python

Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format. emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "*-" + x[6:])

Put your asterisks in front, then grab the last 4 digits. new_ssn = '*--' + emp_pd["SSN"][-4:]

You can use regex df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']}) def func(x): return re.sub(r'\d{3}-\d{2}','-', x) df['ssn'] = df['ssn'].apply(func) print(df) Output: ssn 0 --3333 1 --1123 2 --3425

Related

Multi-part manipulation post str.split() Pandas

How to create a new string by combining values corresponding to keys in several dictionaries in Python?

how do i seperate two numbers in pandas

Python: Add 0/Zero in a string inside a cell

How to extract content from the regex output which has square bracket in python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - Replace first five characters in a column with asterisks - python

Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format. emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])

Put your asterisks in front, then grab the last 4 digits. new_ssn = '***-**-' + emp_pd["SSN"][-4:]

You can use regex df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']}) def func(x): return re.sub(r'\d{3}-\d{2}','***-**', x) df['ssn'] = df['ssn'].apply(func) print(df) Output: ssn 0 ***-**-3333 1 ***-**-1123 2 ***-**-3425

Related

Multi-part manipulation post str.split() Pandas

How to create a new string by combining values corresponding to keys in several dictionaries in Python?

how do i seperate two numbers in pandas

Python: Add 0/Zero in a string inside a cell

How to extract content from the regex output which has square bracket in python

Categories

Resources

Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format. emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "*-" + x[6:])

Put your asterisks in front, then grab the last 4 digits. new_ssn = '*--' + emp_pd["SSN"][-4:]

You can use regex df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']}) def func(x): return re.sub(r'\d{3}-\d{2}','-', x) df['ssn'] = df['ssn'].apply(func) print(df) Output: ssn 0 --3333 1 --1123 2 --3425