I have a dataframe, it has many timestamps, what I'm trying to do is get the lower of two dates only if both columns are not null. For example.
Internal Review Imported Date Lower Date
1 2/9/2018 19:44
2 2/15/2018 1:20 2/13/2018 2:18 2/13/2018 2:18
3 2/7/2018 23:17 2/12/2018 9:34 2/7/2018 23:17
4 2/12/2018 9:25
5 2/1/2018 20:57 2/12/2018 9:24 2/1/2018 20:57
If I wanted the lower of Internal Review and Imported Date, row one and four would not return any value, but would return the lower dates because they both contain dates. I know the .min(axis=1) will return a date, but they can be null which is the problem.
I tried copying something similar to here:
def business_days(start, end):
mask = pd.notnull(start) & pd.notnull(end)
start = start.values.astype('datetime64[D]')[mask]
end = end.values.astype('datetime64[D]')[mask]
result = np.empty(len(mask), dtype=float)
result[mask] = np.busday_count(start, end)
result[~mask] = np.nan
return result
and tried
def GetLowestDays(col1, col2, df):
df = df.copy()
Start = col1.copy().notnull()
End = col2.copy().notnull()
Col3 = [Start, End].min(axis=1)
return col3
But simply get a "AttributeError: 'list' object has no attribute 'min'"
The following code should do the trick :
df['Lower Date'] = df[( df['Internal Review'].notnull() ) & ( df['Imported Date'].notnull() )][['Internal Review','Imported Date']].min(axis=1)
The new column will be filled by the minimum if both are not null.
Nicolas
Related
I am trying to iterate through rows of a Pandas df to get data from one column of the row, and using that data to add new columns. The code is listed below but it is VERY slow. Is there any way to do what I am trying to do without iterating thru the individual rows of the dataframe?
ctqparam = []
wwy = []
ww = []
for index, row in df.iterrows():
date = str(row['Event_Start_Time'])
day = int(date[8] + date[9])
month = int(date[5] + date[6])
total = 0
for i in range(0, month-1):
total += months[i]
total += day
out = total // 7
ww += [out]
wwy += [str(date[0] + date[1] + date[2] + date[3])]
val = str(row['TPRev'])
out = ""
for letter in val:
if letter != '.':
out += letter
df.replace(to_replace=row['TPRev'], value=str(out), inplace = True)
val = str(row['Subtest'])
if val in ctqparam_dict.keys():
ctqparam += [ctqparam_dict[val]]
# add WWY column, WW column, and correct data format of Test_Tape column
df.insert(0, column='Work_Week_Year', value = wwy)
df.insert(3, column='Work_Week', value = ww)
df.insert(4, column='ctqparam', value = ctqparam)
It's hard to say exactly what your trying to do. However, if you're looping through rows chances are that there is a better way to do it.
For example, given a csv file that looks like this..
Event_Start_Time,TPRev,Subtest
4/12/19 06:00,"this. string. has dots.. in it.",{'A_Dict':'maybe?'}
6/10/19 04:27,"another stri.ng wi.th d.ots.",{'A_Dict':'aVal'}
You may want to:
Format Event_Start_Time as datetime.
Get the week number from Event_Start_Time.
Remove all the dots (.) from the strings in column TPRev.
Expand a dictionary contained in Subtest to its own column.
Without looping through the rows, consider doing thing by columns. Like doing it to the first 'cell' of the column and it replicates all the way down.
Code:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Event_Start_Time TPRev Subtest
0 4/12/19 06:00 this. string. has dots.. in it. {'A_Dict':'maybe?'}
1 6/10/19 04:27 another stri.ng wi.th d.ots. {'A_Dict':'aVal'}
# format 'Event_Start_Time' as as datetime
df['Event_Start_Time'] = pd.to_datetime(df['Event_Start_Time'], format='%d/%m/%y %H:%M')
# get the week number from 'Event_Start_Time'
df['Week_Number'] = df['Event_Start_Time'].dt.isocalendar().week
# replace all '.' (periods) in the 'TPRev' column
df['TPRev'] = df['TPRev'].str.replace('.', '', regex=False)
# get a dictionary string out of column 'Subtest' and put into a new column
df = pd.concat([df.drop(['Subtest'], axis=1), df['Subtest'].map(eval).apply(pd.Series)], axis=1)
print(df)
Event_Start_Time TPRev Week_Number A_Dict
0 2019-12-04 06:00:00 this string has dots in it 49 maybe?
1 2019-10-06 04:27:00 another string with dots 40 aVal
print(df.info())
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Event_Start_Time 2 non-null datetime64[ns]
1 TPRev 2 non-null object
2 Week_Number 2 non-null UInt32
3 A_Dict 2 non-null object
dtypes: UInt32(1), datetime64[ns](1), object(2)
So you end up with a dataframe like this...
Event_Start_Time TPRev Week_Number A_Dict
0 2019-12-04 06:00:00 this string has dots in it 49 maybe?
1 2019-10-06 04:27:00 another string with dots 40 aVa
Obviously you'll probably want to do other things. Look at your data. Make a list of what you want to do to each column or what new columns you need. Don't mention how right now as chances are it's possible and has been done before - you just need to find the existing method.
You may write down get the difference in days from the current row and the row beneath etc.). Finally search out how to do the formatting or calculation you require. Break the problem down.
I have 2 dataframes UsdBrlDSlice and indexesM. The first is in daily basis, and has and index in yyyy-mm-dd format, and the second is in monthly basis, and has an index in yyyy-mm format.
Example of UsdBrlDSlice:
USDBRL
date
1994-01-03 331.2200
1994-01-04 336.4900
1994-01-05 341.8300
1994-01-06 347.2350
1994-01-07 352.7300
...
2020-10-05 5.6299
2020-10-06 5.5205
2020-10-07 5.6018
2020-10-08 5.6200
2020-10-09 5.5393
I need to insert a new column in UsdBrlDSlice, multiplying it´s value USDBRL with a specific column in indexesM['c'], but matching the correct month of both indexes.
Something like excel´s vlookup multiplication. Thanks.
I solved 1) creating a new y-m column in the first dataframe, and then 2) applying the map() function:
UsdBrlDSlice['y-m'] = UsdBrlDSlice.index.to_period('M')
UsdBrlDSlice['new col'] = UsdBrlDSlice['USDBRL'] * UsdBrlDSlice['y-m'].map(indexesM.set_index(indexesM.index)['c'])
UsdBrlDSliceTmp = UsdBrlDSlice.copy()
UsdBrlDSliceTmp['date_col'] = UsdBrlDSliceTmp.index.values
indexesMTmp = indexesM.copy()
indexesMTmp['date_col'] = indexesMTmp.index.values
UsdBrlDSliceTmp['month'] = UsdBrlDSliceTmp['date_col'].apply(lambda x: x.month)
indexesMTmp['month'] = indexesMTmp['date_col'].apply(lambda x: x.month)
UsdBrlDSliceTmp = UsdBrlDSliceTmp.merge(indexesMTmp, on='month', how='left')
UsdBrlDSliceTmp['target'] = UsdBrlDSliceTmp['USDBRL']*UsdBrlDSliceTmp['c']
UsdBrlDSlice['new_col'] = UsdBrlDSliceTmp['target']
Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code.
#coded into python
period = ?? (user adds this in from input screen)
I need to create another column of data that uses the input period number to perform a calculation of other columns.
So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work.
I can do this in SQL using case when. So using the input period then sum what columns I need to.
select Account #,
'&Period' AS Period,
'&Year' AS YR,
case
When '&Period' = '1' then sum(d_cf+d_1)
when '&Period' = '2' then sum(d_cf+d_1+d_2)
when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3)
I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way.
Can you help more or point me in a better direction?
You could certainly do something like
df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python:
def get_calculation(df, period=NULL):
'''
df = pandas data frame
period = integer type
'''
if period == 1:
return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1)
if period == 2:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1)
if period == 3:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1)
new_df = get_calculation(df, period = 1)
Setup:
df = pd.DataFrame({'d_0':list(range(1,7)),
'd_1': list(range(10,70,10)),
'd_2':list(range(100,700,100)),
'd_3': list(range(1000,7000,1000))})
Setup:
import pandas as pd
ddict = {
'Year':['2018','2018','2018','2018','2018',],
'Account_Num':['1111','1122','1133','1144','1155'],
'd_cf':['1','2','3','4','5'],
}
data = pd.DataFrame(ddict)
Create value calculator:
def get_calcs(period):
# Convert period to integer
s = str(period)
# Convert to string value
n = int(period) + 1
# This will repeat the period number by the value of the period number
return ''.join([i * n for i in s])
Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column:
def process_data(data_frame=data, period_column='d_cf'):
# Copy data_frame argument
df = data_frame.copy(deep=True)
# Run through each value in our period column
for i in df[period_column].values.tolist():
# Create a temporary column
new_column = 'd_{}'.format(i)
# Pass the period into our calculator; Capture the result
calculated_value = get_calcs(i)
# Create a new column based on our period number
df[new_column] = ''
# Use indexing to place the calculated value into our desired location
df.loc[df[period_column] == i, new_column] = calculated_value
# Return the result
return df
Start:
Year Account_Num d_cf
0 2018 1111 1
1 2018 1122 2
2 2018 1133 3
3 2018 1144 4
4 2018 1155 5
Result:
process_data(data)
Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5
0 2018 1111 1 11
1 2018 1122 2 222
2 2018 1133 3 3333
3 2018 1144 4 44444
4 2018 1155 5 555555
I have a dataframe of jobs for different people with star and end time for each job. I'd like to count, every four months, how many jobs each person is responsible for. I figured out away to do it but I'm sure it's tremendously inefficient (I'm new to pandas). It takes quite a while to compute when I run the code on my complete dataset (hundreds of persons and jobs).
Here is what I have so far.
#create a data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'job': pd.Categorical(['job1','job2','job3','job4']),
'person': pd.Categorical(['p1', 'p1', 'p2','p2']),
'start': ['2015-01-01', '2015-06-01', '2015-01-01', '2016- 01- 01'],
'end': ['2015-07-01', '2015- 12-31', '2016-03-01', '2016-12-31']})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
Which gives me
I then create a new dataset with
bdate = min(df['start'])
edate = max(df['end'])
dates = pd.date_range(bdate, edate, freq='4MS')
people = sorted(set(list(df['person'])))
df2 = pd.DataFrame(np.zeros((len(dates), len(people))), index=dates, columns=people)
for d in pd.date_range(bdate, edate, freq='MS'):
for p in people:
contagem = df[(df['person'] == p) &
(df['start'] <= d) &
(df['end'] >= d)]
pos = np.argmin(np.abs(dates - d))
df2.iloc[pos][p] = len(contagem.index)
df2
And I get
I'm sure there must be a better way of doing this without having to loop through all dates and persons. But how?
This answer assumes that each job-person combination is unique. It creates a series for every row with the value equal to the job an index that expands the dates. Then it resamples every 4th month (which is not quarterly but what your solution describes) and counts the unique non-na occurrences.
def make_date_range(x):
return pd.Series(index=pd.date_range(x.start.values[0], x.end.values[0], freq='M'), data=x.job.values[0])
# Iterate through each job person combo and make an entry for each month with the job as the value
df1 = df.groupby(['job', 'person']).apply(make_date_range).unstack('person')
# remove outer level from index
df1.index = df1.index.droplevel('job')
# resample each month counting only unique values
df1.resample('4MS').agg(lambda x: len(x[x.notnull()].unique()))
Output
person p1 p2
2015-01-01 1 1
2015-05-01 2 1
2015-09-01 1 1
2016-01-01 0 2
2016-05-01 0 1
2016-09-01 0 1
And here is a long one line solution that iterates over every rows and creates a new dataframe and stacks all of them together via pd.concat and then resamples.
pd.concat([pd.DataFrame(index = pd.date_range(tup.start, tup.end, freq='4MS'),
data=[[tup.job]],
columns=[tup.person]) for tup in df.itertuples()])\
.resample('4MS').count()
And another one that is faster
df1 = pd.melt(df, id_vars=['job', 'person'], value_name='date').set_index('date')
g = df1.groupby([pd.TimeGrouper('4MS'), 'person'])['job']
g.agg('nunique').unstack('person', fill_value=0)
Following a "chain" of rows and counting the consecutive months from a CSV file.
Currently I am reading a CSV file with 5 columns of interest (based on insurance policies):
CONTRACT_ID START-DATE END-DATE CANCEL_FLAG OLD_CON_ID
123456 2015-05-30 2016-05-30 0 8788
123457 2014-03-20 2015-03-20 0 12000
123458 2009-12-20 2010-12-20 0 NaN
...
I want to count the number of consecutive months a Contract chain goes for.
Example: Taking the START-DATE from the contract at the "front" of the chain (oldest contract) and the END-DATE from the end of the chain (newest contract). Oldest contract being defined by either the one before a cancelled contract in a chain or the one that has no OLD_CON_ID value.
Each row represents a contract and the prev_Con_ID points to the previous contract ID. The desired output is how many months the contract chains goes back until a gap (i.e. customer didn't have a contract for a period of time). If nothing in that column then that is the first contract in this chain.
CANCEL_FLAG should also cut the chain because a value of 1 designates that the contract was cancelled.
Current code counts the number of active contracts for each year by editing the dataframe like so:
df_contract = df_contract[
(df_contract['START_DATE'] <= pd.to_datetime('2015-05-31')) &
(df_contract['END_DATE'] >= pd.to_datetime('2015-05-31')) & (df_contract['CANCEL_FLAG'] == 0 )
]
df_contract = df_contract[df_contract['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount['CONTRACT_ID']
Here are the first 6 lines of code in which I create the dataframes and adjust the datetime values:
file_name = 'EXAMPLENAME.csv'
df = pd.read_csv(file_name)
df_contract = pd.read_csv(file_name)
df_CUSTOMERS = pd.read_csv(file_name)
df_contract['START_DATE'] = pd.to_datetime(df_contract['START_DATE'])
df_contract['END_DATE'] = pd.to_datetime(df_contract['END_DATE'])
Ideal output is something like:
FIRST_CONTRACT_ID CHAIN_LENGTH CON_MONTHS
1234567 5 60
1500001 1 4
800 10 180
Those data points would then be graphed.
EDIT2: CSV file changed, might be easier now. Question updated.
Not sure if I totally undertand your requirement, but does something like this work?:
df_contract['TOTAL_YEARS'] = (df_contract['END_DATE'] - df_contract['START_DATE']
)/np.timedelta64(1,'Y')
df_contract['TOTAL_YEARS'][(df['CANCEL_FLAG'] == 1) && (df['TOTAL_YEARS'] > 1)] = 1
After a lot of trial and error I got it working!
This finds the time difference between the first and last contracts in the chain and finds the length of the chain.
Not the cleanest code by far, but it works:
test = 'START_DATE'
df_short = df_policy[['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns={'OLD_CON_ID':'PID','CONTRACT_ID':'CID'},
inplace = True)
df_test = df_policy[['CONTRACT_ID','END_DATE']]
df_test.rename(columns={'CONTRACT_ID':'CID','END_DATE': 'PED'}, inplace = True)
df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns={'PID':'PPID','CID':'PID'}, inplace = True)
df_merge1 = pd.merge(df_short, df_copy2,
how='left',
on=['PID'])
df_merge1['START_DATE_y'].fillna(df_merge1['START_DATE_x'], inplace = True)
df_merge1.rename(columns={'START_DATE_x':'1_EFF','START_DATE_y':'2_EFF'}, inplace=True)
The copy, merge, fillna, and rename code is repeated for 5 merged dataframes then:
df_merged = pd.merge(df_merge5, df_test,
how='right',
on=['CID'])
df_merged['TOTAL_MONTHS'] = ((df_merged['PED'] - df_merged['6_EFF']
)/np.timedelta64(1,'M'))
df_merged4 = df_merged[
(df_merged['PED'] >= pd.to_datetime('2015-07-06'))
df_merged4['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'], axis=1).apply(lambda row: len(pd.unique(row)), axis=1) -3
Hopefully my code is understood and will help someone in the future.