I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Related
I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I have two dictionaries:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
and a table consisting of one single column where bond names are contained:
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
I need to replace the name with a string of the following format: EUA21 where the first two letters are the corresponding value to the currency key in the dictionary, the next letter is the value corresponding to the month key and the last two digits are the year from the name.
I tried to split the name using the following code:
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
but I am not sure how to proceed from here to create the string as I need to search the dictionaries at the same time for the currency and month extract the values join them and add the year from the name onto it.
This will give you a list of what you need:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = {'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']}
result = []
for names in bond_names['Names']:
bond = names.split('.')
result.append(currency[bond[1]] + time[bond[2]] + bond[3])
print(result)
You can do that like this:
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency = {'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = pd.DataFrame({'Names': ['Bond.USD.JAN.21', 'Bond.USD.MAR.25', 'Bond.EUR.APR.22', 'Bond.HUF.JUN.21', 'Bond.HUF.JUL.23', 'Bond.GBP.JAN.21']})
bond_names['Names2'] = bond_names['Names'].apply(lambda x: currency[x[5:8]] + time[x[9:12]] + x[-2:])
print(bond_names['Names2'])
# 0 USA21
# 1 USC25
# 2 EUD22
# 3 HFF21
# 4 HFH23
# 5 GBA21
# Name: Names2, dtype: object
With extended regex substitution:
In [42]: bond_names['Names'].str.replace(r'^[^.]+\.([^.]+)\.([^.]+)\.(\d+)', lambda m: '{}{}{}'.format(curre
...: ncy.get(m.group(1), m.group(1)), time.get(m.group(2), m.group(2)), m.group(3)))
Out[42]:
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
Name: Names, dtype: object
You can try this :
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
for idx, bond in enumerate(bond_names['Names']):
currencyID = currency.get(bond[1])
monthID = time.get(bond[2])
yearID = bond[3]
bond_names['Names'][idx] = currencyID + monthID + yearID
Output
Names
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
how do I separate two numbers if the length of the number is 18 . what I exactly wanna do is I want to separate mobile number(10) and landline number(8) when they are joined(18).
I have tried to extract first 8 numbers but I don't know how to apply condition. and I need to remove first 8 numbers if the condition satisfies
df['Landline'] = df['Number'].str[:8]
I have tried this but I know its wrong
df['Landline'] = df['Number'].apply(lambda x : x.str[:8] if len(x)==18 )
For extracting first 8 numbers, use findall.
df['Number'].str.findall('^\d{8}')
Solution using an example
Here we use the Dummy Data made in the following section.
# separate landline and mobile numbers
phone_numbers = df.Numbers.str.findall('(^\d{8})*(\d{10})').tolist()
# store in a dict
d = dict((i, {'Landline': e[0][0], 'Mobile': e[0][1]}) for i, e in enumerate(phone_numbers))
# make a temporary dataframe
phone_df = pd.DataFrame(d).T
# update original dataframe
df['Landline'] = phone_df['Landline']
df['Mobile'] = phone_df['Mobile']
print(df)
Output:
Numbers Landline Mobile
0 123456780123456789 12345678 0123456789
1 0123456789 0123456789
Dummy Data
df = pd.DataFrame({'Numbers': ['123456780123456789', '0123456789', ]})
print(df)
Output:
Numbers
0 123456780123456789
1 0123456789
Looks like you need
df['Landline'] = df['Number'].apply(lambda x : x[:8] if len(x)==18 else x)
I have this sample data in a cell:
EmployeeID
2016-CT-1028
2016-CT-1028
2017-CT-1063
2017-CT-1063
2015-CT-948
2015-CT-948
So, my problem is how can I add 0 inside this data 2015-CT-948 to
make it like this 2015-CT-0948.
I tried this code:
pattern = re.compile(r'(\d\d+)-(\w\w)-(\d\d\d)')
newlist = list(filter(pattern.match, idList))
Just to get the match regex pattern then add the 0 with zfill() but its not working. Please, can someone give me an idea on how can I do it. Is there anyway I can do it in regex or in pandas. Thank you!
This is one approach using zfill
Ex:
import pandas as pd
def custZfill(val):
val = val.split("-")
#alternative split by last -
#val = val.rsplit("-",1)
val[-1] = val[-1].zfill(4)
return "-".join(val)
df = pd.DataFrame({"EmployeeID": ["2016-CT-1028", "2016-CT-1028",
"2017-CT-1063", "2017-CT-1063",
"2015-CT-948", "2015-CT-948"]})
print(df["EmployeeID"].apply(custZfill))
Output:
0 2016-CT-1028
1 2016-CT-1028
2 2017-CT-1063
3 2017-CT-1063
4 2015-CT-0948
5 2015-CT-0948
Name: EmployeeID, dtype: object
With pandas it can be solved with split instead of regex:
df['EmployeeID'].apply(lambda x: '-'.join(x.split('-')[:-1] + [x.split('-')[-1].zfill(4)]))
In pandas, you could use str.replace
df['EmployeeID'] = df.EmployeeID.str.replace(r'-(\d{3})$', r'-0\1', regex=True)
# Output:
0 2016-CT-1028
1 2016-CT-1028
2 2017-CT-1063
3 2017-CT-1063
4 2015-CT-0948
5 2015-CT-0948
Name: EmployeeID, dtype: object
if the format of the id's is strictly defined, you can also use a simple list comprehension to do this job:
ids = [
'2017-CT-1063',
'2015-CT-948',
'2015-CT-948'
]
new_ids = [id if len(id) == 12 else id[0:8]+'0'+id[8:] for id in ids]
print(new_ids)
# ['2017-CT-1063', '2015-CT-0948', '2015-CT-0948']
Here's a one liner:
df['EmployeeID'].apply(lambda x: '-'.join(xi if i != 2 else '%04d' % int(xi) for i, xi in enumerate(x.split('-'))))
I have a Python's (2.7) Pandas DF which has columns which looks something like this :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com']
I want to extract email from it without the square bracket and single quotes. Output should like this :
email
jsaw#yahoo.com
jfsjhj#yahoo.com
jwrk#yahoo.com
rankw#yahoo.com
I have tried the suggestions from this answer :Replace all occurrences of a string in a pandas dataframe (Python) . But its not working. Any help will be appreciated.
edit:
What if I have array of more than 1 dimension. something like :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com','fsffsnl#gmail.com']
['mklcu#yahoo.com','riserk#gmail.com', 'funkdl#yahoo.com']
is it possible to get the output in three different columns without square brackets and single quotes.
You can use str.strip if type of values is string:
print type(df.at[0,'email'])
<type 'str'>
df['email'] = df.email.str.strip("[]'")
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
If type is list apply Series:
print type(df.at[0,'email'])
<type 'list'>
df['email'] = df.email.apply(pd.Series)
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
EDIT: If you have multiple values in array, you can use:
df1 = df['email'].apply(pd.Series).fillna('')
print df1
0 1 2
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com fsffsnl#gmail.com
4 mklcu#yahoo.com riserk#gmail.com funkdl#yahoo.com
Try this one:
from re import findall
s = "['rankw#yahoo.com']"
m = findall(r"\[([A-Za-z0-9#'._]+)\]", s)
print(m[0].replace("'",''))