How to extract the date from string and aggregate it? - python

I have a dataframe which has two features item_listing_time and item_sale_time.
Those are strings and looks like this:
item_listing_time
item_sale_time
2018-09-30T19:06:21.000-07:00
2018-09-30T23:06:21.000-07:00
I want to create a feature sold_in_24h which is True when sale happens in 24h.
At the moment my workflow looks like this:
# replacing "T" and "." char in item_listing_time and item_sale_time columns by space
data2['item_listing_time'] = data2['item_listing_time'].str.replace('T',' ')
data2['item_listing_time'] = data2['item_listing_time'].str.replace('.',' ')
data2['item_sale_time'] = data2['item_sale_time'].str.replace('T',' ')
data2['item_sale_time'] = data2['item_sale_time'].str.replace('.',' ')
# storing datetime into datetime_listing column as datetime type
data2['date_listing'] = data2.litem_listing_time.str.split(' ').str[0]
data2['time_listing'] = data2.item_listing_time.str.split(' ').str[1]
data2['datetime_listing'] = data2.date_listing + " "+data2.time_listing
data2['datetime_listing'] = pd.to_datetime(data2['datetime_listing'], format='%Y-%m-%d %H:%M:%S')
# same with saletime
data2['date_sale'] = data2.item_sale_time.str.split(' ').str[0]
data2['time_sale'] = data2.item_sale_time.str.split(' ').str[1]
data2['saletime'] = data2.date_sale + " "+data2.time_sale
data2['saletime'] = pd.to_datetime(data2['saletime'], format='%Y-%m-%d %H:%M:%S')
# creating column for sold_in_24h
data2["was_sold_in_24h"] = (data2["saletime"] - data2["datetime_listing"]) < pd.Timedelta(days=1)
This method works, but not sure is it neat way to solve this problem.
Any opinions how to improve it or leave it this way, as it provides the desired result.
Thanks!

You can convert directly the "item_listing_time" and "item_sale_time" to datetime:
df["item_listing_time"] = pd.to_datetime(df["item_listing_time"])
df["item_sale_time"] = pd.to_datetime(df["item_sale_time"])
one_day = pd.Timedelta(days=1)
df["sold_in_24h"] = df.apply(
lambda x: x["item_listing_time"] + one_day > x["item_sale_time"], axis=1
)
print(df)
Prints:
item_listing_time item_sale_time sold_in_24h
0 2018-09-30 19:06:21-07:00 2018-09-30 23:06:21-07:00 True

For replacing "T" and "." char in item_listing_time and item_sale_time columns by space:
One option you can use is that, replace calls can be chained together for single case, it will be better like this:
data2['item_sale_time'] = data2['item_sale_time'].str.replace('.',' ').replace('T',' ')
For storing datetime into datetime_listing column as datetime type:
Store the split result and use it:
data = data2.litem_listing_time.str.split(' ')
data2['date_listing'] = data[0]
data2['time_listing'] = data[1]
But these are very basic tip-offs, I will also be looking for some deep-thought suggestions here!

Related

Trying to make a new column with existing column that have 'int' and 'string'

data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = str(data['Prop-House Number']) + data['Prop-Street Name'] + data['Prop-Mode'] + str(data['Prop-Apt Unit Number'])
df = pd.DataFrame(data, columns = ['Name','New_addy'])
So this is the code
As you can see Prop-House Number and Prop-Apt Number are both int, and the rest are strings, I am trying to combine all these so that the full address is under one column labeled 'New addy'
Follow the string assignment with each variable using map as mentioned below:
data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = data['Prop-House Number'].map(str) + data['Prop-Street Name'].map(str) + data['Prop-Mode'].map(str) + data['Prop-Apt Unit Number'].map(str)
#select the desired columns for further work
data = data[['Name','New_addy']]
One way is using list comprehension:
data['New_addy'] = [str(n) + street + mode + str(apt_n) for n,street,mode,apt_n in zip(
data['Prop-House Number'],data['Prop-Street Name'],data['Prop-Mode'],data['Prop-Apt Unit Number'])]

Concatenate on specific condition python

EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data
You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)
Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19

Formatting a specific row of integers to the ssn style

I want to format a specific column of integers to ssn format (xxx-xx-xxxx). I saw that openpyxl has builtin styles. I have been using pandas and wasn't sure if it could do this specific format.
I did see this -
df.iloc[:,:].str.replace(',', '')
but I want to replace the ',' with '-'.
import pandas as pd
df = pd.read_excel('C:/Python/Python37/Files/Original.xls')
df.drop(['StartDate', 'EndDate','EmployeeID'], axis = 1, inplace=True)
df.rename(columns={'CheckNumber': 'W/E Date', 'CheckBranch': 'Branch','DeductionAmount':'Amount'},inplace=True)
df = df[['Branch','Deduction','CheckDate','W/E Date','SSN','LastName','FirstName','Amount','Agency','CaseNumber']]
ssn = (df['SSN'] # the integer column
.astype(str) # cast integers to string
.str.zfill(8) # zero-padding
.pipe(lambda s: s.str[:2] + '-' + s.str[2:4] + '-' + s.str[4:]))
writer = pd.ExcelWriter('C:/Python/Python37/Files/Deductions Report.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Your question is a bit confusing, see if this helps:
If you have a column of integers and you want to create a new one made up of strings in SSN (Social Security Number) format. You can try something like:
df['SSN'] = (df['SSN'] # the "integer" column
.astype(int) # the integer column
.astype(str) # cast integers to string
.str.zfill(9) # zero-padding
.pipe(lambda s: s.str[:3] + '-' + s.str[3:5] + '-' + s.str[5:]))
Setup
Social Security numbers are nine-digit numbers using the form: AAA-GG-SSSS
s = pd.Series([111223333, 222334444])
0 111223333
1 222334444
dtype: int64
Option 1
Using zip and numpy.unravel_index:
pd.Series([
'{}-{}-{}'.format(*el)
for el in zip(*np.unravel_index(s, (1000,100,10000)))
])
Option 2
Using f-strings:
pd.Series([f'{i[:3]}-{i[3:5]}-{i[5:]}' for i in s.astype(str)])
Both produce:
0 111-22-3333
1 222-33-4444
dtype: object
I prefer:
df["ssn"] = df["ssn"].astype(str)
df["ssn"] = df["ssn"].str.strip()
df["ssn"] = (
df.ssn.str.replace("(", "")
.str.replace(")", "")
.str.replace("-", "")
.str.replace(" ", "")
.apply(lambda x: f"{x[:3]}-{x[3:5]}-{x[5:]}")
)
This take into account rows that are partially formatted, fully formatted, or not formatted and correctly formats them all.
For Example:
data = [111111111,123456789,"222-11-3333","433-3131234"]
df = pd.DataFrame(data, columns=['ssn'])
Gives you:
Before
After the code you then get:
After

splitting string a multiple delimiters

Is there a way to split a string in a pandas dataframe column like
coordinates(gDNA)
chr10:g.89711916T>A
into tab separated fields
chr\start\ref\alt
chr10\t89711916\tT\tA
in pandas.
So far I've tried
df[['chr','others']] = df['coordinates(gDNA)'].str.split(':',expand=True)
and extracted the first part, but not sure what to do for the rest
Use:
df[['chr','start', 'alt']] = df['coordinates(gDNA)'].str.split(':g.|>',expand=True)
df[['start','ref']] = df['start'].str.extract('(\d+)(\D+)')
print (df)
coordinates(gDNA) chr start alt ref
0 chr10:g.89711916T>A chr10 89711916 A T
Try this:
df[['chr','start','ref','alt']] = df['coordinates(gDNA)'].str.extract('(\w+).*?(\d+)(\w+).*?(\w+)')
df = pd.DataFrame(
columns=['coordinates(gDNA)'],
data=[['chr10:g.89711916T>A']]
)
def parser(x):
ch, x = x.split(':g.')
start = int(x[:-3])
ref = x[-3]
alt = x[-1]
return dict(chr=ch, start=start, ref=ref, alt=alt)
pd.DataFrame([*map(parser, df['coordinates(gDNA)'])], df.index)
alt chr ref start
0 A chr10 T 89711916

Split and Join Series in Pandas

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

Categories