Formatting a specific row of integers to the ssn style - python

I want to format a specific column of integers to ssn format (xxx-xx-xxxx). I saw that openpyxl has builtin styles. I have been using pandas and wasn't sure if it could do this specific format.
I did see this -
df.iloc[:,:].str.replace(',', '')
but I want to replace the ',' with '-'.
import pandas as pd
df = pd.read_excel('C:/Python/Python37/Files/Original.xls')
df.drop(['StartDate', 'EndDate','EmployeeID'], axis = 1, inplace=True)
df.rename(columns={'CheckNumber': 'W/E Date', 'CheckBranch': 'Branch','DeductionAmount':'Amount'},inplace=True)
df = df[['Branch','Deduction','CheckDate','W/E Date','SSN','LastName','FirstName','Amount','Agency','CaseNumber']]
ssn = (df['SSN'] # the integer column
.astype(str) # cast integers to string
.str.zfill(8) # zero-padding
.pipe(lambda s: s.str[:2] + '-' + s.str[2:4] + '-' + s.str[4:]))
writer = pd.ExcelWriter('C:/Python/Python37/Files/Deductions Report.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()

Your question is a bit confusing, see if this helps:
If you have a column of integers and you want to create a new one made up of strings in SSN (Social Security Number) format. You can try something like:
df['SSN'] = (df['SSN'] # the "integer" column
.astype(int) # the integer column
.astype(str) # cast integers to string
.str.zfill(9) # zero-padding
.pipe(lambda s: s.str[:3] + '-' + s.str[3:5] + '-' + s.str[5:]))

Setup
Social Security numbers are nine-digit numbers using the form: AAA-GG-SSSS
s = pd.Series([111223333, 222334444])
0 111223333
1 222334444
dtype: int64
Option 1
Using zip and numpy.unravel_index:
pd.Series([
'{}-{}-{}'.format(*el)
for el in zip(*np.unravel_index(s, (1000,100,10000)))
])
Option 2
Using f-strings:
pd.Series([f'{i[:3]}-{i[3:5]}-{i[5:]}' for i in s.astype(str)])
Both produce:
0 111-22-3333
1 222-33-4444
dtype: object

I prefer:
df["ssn"] = df["ssn"].astype(str)
df["ssn"] = df["ssn"].str.strip()
df["ssn"] = (
df.ssn.str.replace("(", "")
.str.replace(")", "")
.str.replace("-", "")
.str.replace(" ", "")
.apply(lambda x: f"{x[:3]}-{x[3:5]}-{x[5:]}")
)
This take into account rows that are partially formatted, fully formatted, or not formatted and correctly formats them all.
For Example:
data = [111111111,123456789,"222-11-3333","433-3131234"]
df = pd.DataFrame(data, columns=['ssn'])
Gives you:
Before
After the code you then get:
After

Related

Split One Column to Multiple Columns in Pandas

I want to split one current column into 3 columns. In screenshot we see the builder column, which need to be split in 3 more column such as b.name , city and country. So I use str.split() method in python to split the column which give me good result for 2 column ownerName = df['owner_name'] df[["ownername", "owner_country"]] = df["owner_name"].str.split("-", expand=True)
But when it come to three columns ownerName = df['owner_name'] df[["ownername", "city", "owner_country"]] = df["owner_name"].str.split("," ,"-", expand=True), where I use 2 delimiter ',' and '-' it give me this error:
File "C:\Users....\lib\site-packages\pandas\core\frame.py", line 3160, in setitem
self._setitem_array(key, value)
File "C:\Users....\lib\site-packages\pandas\core\frame.py", line 3189, in _setitem_array
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
whats best solution for 2 delimiter ',' and '-', Also there is some empty rows too.
Your exact input is unclear, but assuming the sample input kindly provided by #ArchAngelPwn, you could use str.split with a regex:
names = ['Builder_Name', 'City_Name', 'Country']
out = (df['Column1']
.str.split(r'\s*[,-]\s*', expand=True) # split on "," or "-" with optional spaces
.rename(columns=dict(enumerate(names))) # rename 0/1/2 with names in order
)
output:
Builder_Name City_Name Country
0 Builder Name City Country
You can combine some rows if you feel like you need to, but this was a possible options and should be pretty readable for most developers included in the projects
data = {
'Column1' : ['Builder Name - City, Country']
}
df = pd.DataFrame(data)
df['Builder_Name'] = df['Column1'].apply(lambda x : x.split('-')[0])
df['City_Name'] = df['Column1'].apply(lambda x : x.split('-')[1:])
df['City_Name'] = df['City_Name'][0]
df['City_Name'] = df['City_Name'].apply(lambda x : x.split()[0])
df['City_Name'] = df['City_Name'].apply(lambda x : x.replace(',', ''))
df['Country'] = df['Column1'].apply(lambda x : x.split(',')[1])
df = df[['Builder_Name', 'City_Name', 'Country']]
df
As mentioned in questions there is 2 delimiter "-" and ",". for one we simply use str.split("-", expand=True) and for 2 different delimiter we can use same code with addition of small code such as column1 = name-city name ,country (Owner = SANTIERUL NAVAL CONSTANTA - CONSTANTZA, ROMANIA) code will be write as ownerName = df['owner_name'] df[["Owner_name", "City_Name", "owner_country"]] = df["owner_name"].str.split(r', |- |\*|\n', expand=True)

Trying to make a new column with existing column that have 'int' and 'string'

data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = str(data['Prop-House Number']) + data['Prop-Street Name'] + data['Prop-Mode'] + str(data['Prop-Apt Unit Number'])
df = pd.DataFrame(data, columns = ['Name','New_addy'])
So this is the code
As you can see Prop-House Number and Prop-Apt Number are both int, and the rest are strings, I am trying to combine all these so that the full address is under one column labeled 'New addy'
Follow the string assignment with each variable using map as mentioned below:
data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = data['Prop-House Number'].map(str) + data['Prop-Street Name'].map(str) + data['Prop-Mode'].map(str) + data['Prop-Apt Unit Number'].map(str)
#select the desired columns for further work
data = data[['Name','New_addy']]
One way is using list comprehension:
data['New_addy'] = [str(n) + street + mode + str(apt_n) for n,street,mode,apt_n in zip(
data['Prop-House Number'],data['Prop-Street Name'],data['Prop-Mode'],data['Prop-Apt Unit Number'])]

How to add prefix to the selected records in python pandas df

I have df where some of the records in the column contains prefix and some of them not. I would like to update records without prefix. Unfortunately, my script adds desired prefix to each record in df:
new_list = []
prefix = 'x'
for ids in df['ids']:
if ids.find(prefix) < 1:
new_list.append(prefix + ids)
How can I ommit records with the prefix?
I've tried with df[df['ids'].str.contains(prefix)], but I'm getting an error.
Use Series.str.startswith for mask and add values with numpy.where:
df = pd.DataFrame({'ids':['aaa','ssx','xwe']})
prefix = 'x'
df['ids'] = np.where(df['ids'].str.startswith(prefix), '', prefix) + df['ids']
print (df)
ids
0 xaaa
1 xssx
2 xwe

How to extract the date from string and aggregate it?

I have a dataframe which has two features item_listing_time and item_sale_time.
Those are strings and looks like this:
item_listing_time
item_sale_time
2018-09-30T19:06:21.000-07:00
2018-09-30T23:06:21.000-07:00
I want to create a feature sold_in_24h which is True when sale happens in 24h.
At the moment my workflow looks like this:
# replacing "T" and "." char in item_listing_time and item_sale_time columns by space
data2['item_listing_time'] = data2['item_listing_time'].str.replace('T',' ')
data2['item_listing_time'] = data2['item_listing_time'].str.replace('.',' ')
data2['item_sale_time'] = data2['item_sale_time'].str.replace('T',' ')
data2['item_sale_time'] = data2['item_sale_time'].str.replace('.',' ')
# storing datetime into datetime_listing column as datetime type
data2['date_listing'] = data2.litem_listing_time.str.split(' ').str[0]
data2['time_listing'] = data2.item_listing_time.str.split(' ').str[1]
data2['datetime_listing'] = data2.date_listing + " "+data2.time_listing
data2['datetime_listing'] = pd.to_datetime(data2['datetime_listing'], format='%Y-%m-%d %H:%M:%S')
# same with saletime
data2['date_sale'] = data2.item_sale_time.str.split(' ').str[0]
data2['time_sale'] = data2.item_sale_time.str.split(' ').str[1]
data2['saletime'] = data2.date_sale + " "+data2.time_sale
data2['saletime'] = pd.to_datetime(data2['saletime'], format='%Y-%m-%d %H:%M:%S')
# creating column for sold_in_24h
data2["was_sold_in_24h"] = (data2["saletime"] - data2["datetime_listing"]) < pd.Timedelta(days=1)
This method works, but not sure is it neat way to solve this problem.
Any opinions how to improve it or leave it this way, as it provides the desired result.
Thanks!
You can convert directly the "item_listing_time" and "item_sale_time" to datetime:
df["item_listing_time"] = pd.to_datetime(df["item_listing_time"])
df["item_sale_time"] = pd.to_datetime(df["item_sale_time"])
one_day = pd.Timedelta(days=1)
df["sold_in_24h"] = df.apply(
lambda x: x["item_listing_time"] + one_day > x["item_sale_time"], axis=1
)
print(df)
Prints:
item_listing_time item_sale_time sold_in_24h
0 2018-09-30 19:06:21-07:00 2018-09-30 23:06:21-07:00 True
For replacing "T" and "." char in item_listing_time and item_sale_time columns by space:
One option you can use is that, replace calls can be chained together for single case, it will be better like this:
data2['item_sale_time'] = data2['item_sale_time'].str.replace('.',' ').replace('T',' ')
For storing datetime into datetime_listing column as datetime type:
Store the split result and use it:
data = data2.litem_listing_time.str.split(' ')
data2['date_listing'] = data[0]
data2['time_listing'] = data[1]
But these are very basic tip-offs, I will also be looking for some deep-thought suggestions here!

Split and Join Series in Pandas

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

Categories