I have two columns of dates need to be compared, date1 is a list of certain dates, date2 is random date (dob). I need to compare month and day by some conditon to make a flag. sample like:
df_sample = DataFrame({'date1':('2015-01-15','2015-01-15','2015-03-15','2015-04-15','2015-05-15'),
'dob':('1999-01-25','1987-12-12','1965-03-02','2000-08-02','1992-05-15')}
I create a function based on condition below
def eligible(date1,dob):
if date1.month - dob.month==0 and date1.day <= dob.day:
return 'Y'
elif date1.month - dob.month==1 and date1.day > dob.day:
return 'Y'
else:
return 'N'
I want to apply this function to orginal df which has more than 5M rows, hence for loop is not efficiency, is there any way to achieve this?
Datatype is date, not datetime
I think you need numpy.where with conditions chained by | (or):
df_sample['date1'] = pd.to_datetime(df_sample['date1'])
df_sample['dob'] = pd.to_datetime(df_sample['dob'])
months_diff = df_sample.date1.dt.month - df_sample.dob.dt.month
days_date1 = df_sample.date1.dt.day
days_dob = df_sample.dob.dt.day
m1 = (months_diff==0) & (days_date1 <= days_dob)
m2 = (months_diff==1) & (days_date1 > days_dob)
df_sample['out'] = np.where(m1 | m2 ,'Y','N')
print (df_sample)
date1 dob out
0 2015-01-15 1999-01-25 Y
1 2015-01-15 1987-12-12 N
2 2015-03-15 1965-03-02 N
3 2015-04-15 2000-08-02 N
4 2015-05-15 1992-05-15 Y
Using datetime is certainly beneficial:
df_sample['dob'] = pd.to_datetime(df_sample['dob'])
df_sample['date1'] = pd.to_datetime(df_sample['date1'])
Once you have it, your formula can be literally applied to all rows:
df_sample['eligible'] =
( (df_sample.date1.dt.month == df_sample.dob.dt.month)\
& (df_sample.date1.dt.day <= df_sample.dob.dt.day)) |\
( (df_sample.date1.dt.month - df_sample.dob.dt.month == 1)\
& (df_sample.date1.dt.day > df_sample.dob.dt.day))
The result is boolean (True/False), but you can easily convert it to "Y"/"N", if you want.
Related
I'm cleaning a data set with 6 columns and just under 9k rows. As part of the clean up I have to find zero/negative, repetitive, interpolated, and outlier values defined as:
repetitive values - 3 subsequent values are equivalent up to 6 decimal places, flag the first one
interpolated values - take a = row1_val - row2_val, b = row2_val-row3_val, c = row3_val - row4_val, etc. If a=b or b=c, etc. flag
outlier values - 1.1peak < MW < 0.1peak
Right now I am using for loops on the data frame to do the row comparisons and flag the values, put them into a new data frame, and replace them with 999999 but it takes FOREVER. I used the following code to find and replace the zero/negative values, but I cant seem to make it work for the multi row functions used in the for loop. Can anyone show me how this works?
zero/negative values:
df = (df.drop(data_columns, axis=1).join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
Missing_Vals_df = df.loc[(df['A KW'].isnull()) | (df['A KVAR'].isnull()) | (df['B KW'].isnull()) | (df['B KVAR'].isnull()) | (df['C KW'].isnull()) | (df['C KVAR'].isnull())]
df = df.fillna(999999)
Loops:
for x in range(len(df)-2):
for thing in data_columns:
if df.loc[x][thing] <= 0:
df = df.replace(to_replace = df.loc[x][thing], value=999999)
elif (round(df.loc[x][thing], 6) == round(df.loc[x+1][thing], 6) == round(df.loc[x+2][thing], 6)) & (df.loc[x][thing] != 999999):
if x not in duplicate_loc:
duplicate_loc.append(x)
duplicate_df = duplicate_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
elif (round((df.loc[x+1][thing] - df.loc[x][thing]), 3) == round((df.loc[x+2][thing] - df.loc[x+1][thing]), 3)) & (df.loc[x][thing] != 999999):
if x not in interpolated_loc:
interpolated_loc.append(x)
interpolated_df = interpolated_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
elif ((df.loc[x][thing] > 1.1*df_peak.loc[0]['Value']) | (df.loc[x][thing] > 1.1*df_peak.loc[0]['Value']) | (df.loc[x][thing] > 1.1*df_peak.loc[0]['Value'])) & (df.loc[x][thing] != 999999):
if x not in outlier_loc:
outlier_loc.append(x)
outlier_df = outlier_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
I have a dataframe of elements with a start and end datetime. What is the best option to find intersections of the dates? My naive approach right now consists of two nested loops cross-comparing the elements, which obviously is super slow. What would be a better way to achieve that?
dict = {}
start = "start_time"
end = "end_time"
for index1, rowLoop1 in df[{start, end}].head(500).iterrows():
matches = []
dict[(index1, rowLoop1[start])] = 0
for index2, rowLoop2 in df[{start,end}].head(500).iterrows():
if index1 != index2:
if date_intersection(rowLoop1[start], rowLoop1[end], rowLoop2[start], rowLoop2[end]):
dict[(index1, rowLoop1[start])] += 1
Code for date_intersection:
def date_intersection(t1start, t1end, t2start, t2end):
if (t1start <= t2start <= t2end <= t1end): return True
elif (t1start <= t2start <= t1end):return True
elif (t1start <= t2end <= t1end):return True
elif (t2start <= t1start <= t1end <= t2end):return True
else: return False
Sample data:
id,start_date,end_date
41234132,2021-01-10 10:00:05,2021-01-10 10:30:27
64564512,2021-01-10 10:10:00,2021-01-11 10:28:00
21135765,2021-01-12 12:30:00,2021-01-12 12:38:00
87643252,2021-01-12 12:17:00,2021-01-12 12:42:00
87641234,2021-01-12 12:58:00,2021-01-12 13:17:00
You can do something like merging your dataframe with itself to get the cartesian product and comparing columns.
df = df.merge(df, how='cross', suffixes=('','_2'))
df['date_intersection'] = (((df['start_date'].le(df['start_date_2']) & df['start_date_2'].le(df['end_date'])) | # start 2 within start/end
(df['start_date'].le(df['end_date_2']) & df['end_date_2'].le(df['end_date'])) | # end 2 within start/end
(df['start_date_2'].le(df['start_date']) & df['start_date'].le(df['end_date_2'])) | # start within start 2/end 2
(df['start_date_2'].le(df['end_date']) & df['end_date'].le(df['end_date_2']))) & # end within start 2/end 2
df['id'].ne(df['id_2'])) # id not compared to itself
and then to return the ids and if they have a date intersection...
df.groupby('id')['date_intersection'].any()
id
21135765 True
41234132 True
64564512 True
87641234 False
87643252 True
or if you need the ids that were intersected
df.loc[df['date_intersection'], :].groupby(['id'])['id_2'].agg(list).to_frame('intersected_ids')
intersected_ids
id
21135765 [87643252]
41234132 [64564512]
64564512 [41234132]
87643252 [21135765]
hoping someone helps me out. I have a nested json file and I'm trying to calculate the age difference between two lines of the file, the start_date and end_date with date format of mm/yyyy only. So I'm trying to split it so I can calculate the year difference between end_date and start_date, if over 10 years, I add to another list.
This is my code below, but it prints an empty list and I don't know how to fix it. Any tips or directions will be appreciated
Oh...I have to use default python libraries so even though pandas will be easier, I can't use it.
remove_card=[]
def datebreakdown(data_file):
expr1 = data_file['Credit Card']['start_date']
expr2 = data_file['Credit Card']['end_date']
breakdown1 = expr1.split('/')
breakdown2 = expr2.split('/')
card_month = int(breakdown1[0]) - int(breakdown2[0])
card_year= int(breakdown1[1]) - int(breakdown2[1])
if card_year >= 10:
return True
elif card_year == 10 and card_year > 0:
return True
else:
return False
for line in data_json: #data_json is name of the json file.
if datebreakdown(data_file) == True:
remove_card.append(data_file)
I think these are the conditions you want:
if card_year > 10:
return True
elif card_year == 10 and card_month > 0:
return True
else:
return False
The first condition should be strictly >, not >=. The second condition should compare the months when the year difference is exactly 10.
Another problem is that you're subtracting the dates in the wrong order. You're subtracting the end from the start, so it will always be negative. So those subtractions should be:
card_month = int(breakdown2[1]) - int(breakdown1[0])
card_year= int(breakdown2[1]) - int(breakdown1[1])
def datebreakdown(data_file):
expr1 = data_file['Credit Card']['start_date']
expr2 = data_file['Credit Card']['end_date']
year1, month1 = expr1.split('/')
year2, month2 = expr2.split('/')
start_date = int(year1) + int(month1)/12
end_date = int(year2) + int(month2)/12
return end_date - start_date > 10
DEMO
I have a data frame in python containing the following information:
Day Type
Weekday 1
Weekday 2
Weekday 3
Weekday 1
Weekend 2
Weekend 1
I want to add a new column by generating a Weibull random number but each pair of "Day" and "Type" has a unique Weibull distributions.
For example, I have tried the following codes but they did not work:
df['Duration'][ (df['Day'] == "Weekend") & (df['Type'] == 1) ] = int(random.weibullvariate(5.6/math.gamma(1+1/6),6))
df['Duration'] = df['Day','Type'].map(lambda x,y: int(random.weibullvariate(5.6/math.gamma(1+1/10),10)) if x == "Weekday" and y == 1 if x == "Weekend" and y == 1 int(random.weibullvariate(5.6/math.gamma(1+1/6),6)))
Define a function that generates the random number you want and apply it to the rows.
import io
import random
import math
import pandas as pd
data = io.StringIO('''\
Day Type
Weekday 1
Weekday 2
Weekday 3
Weekday 1
Weekend 2
Weekend 1
''')
df = pd.read_csv(data, delim_whitespace=True)
def duration(row):
if row['Day'] == 'Weekend' and row['Type'] == 1:
return int(random.weibullvariate(5.6/math.gamma(1+1/6),6))
if row['Day'] == 'Weekday' and row['Type'] == 1:
return int(random.weibullvariate(5.6/math.gamma(1+1/10),10))
df['Duration'] = df.apply(duration, axis=1)
I have a pandas DataFrame named Joined with 5 fields:
product | price | percentil_25 | percentil_50 | percentile_75
for each row I want to class the price like this:
if the price is below percentil_25 I'm giving to this product the class 1, and so on
So what I did is:
classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []
for index in range(len(joined)):
classe_final['sku'].append(joined.values[index][0])
if(float(joined.values[index][1]) <= float(joined.values[index][2])):
classe_final['class'].append(1)
elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
classe_final['class'].append(2)
elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
classe_final['class'].append(3)
else:
classe_final['class'].append(4)
But as my DataFrame is quite big it's taking forever.
Do you have any idea how I could do this quicker?
# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()
2 ways, define a func and call apply
def class(x):
if x.price < x.percentil_25:
return 1
elif x.price >= x.percentil_25 and x.price < x.percentil_50:
return 2:
elif x.price >= x.percentil_50 and x.price < x.percentil_75:
return 2:
elif x.price >= x.percentil_75:
return 4
df['class'] = other_df.apply(lambda row: class(row'), axis=1)
another way which I think is better and will be much faster is we could add the 'class' column to your existing df and use loc and then just take a view of the 2 columns of interest:
joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4
classe_final = joined[['cku', 'class']]
Just for kicks you could use a load of np.where conditions:
classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where( joined['price'] > joined['percentil_50'], 3, np.where( joined['price'] > joined['percentil_25'], 2, 1 ) ) )
this evaluates whether the price is greater than percentil_75, if so then class 4 otherwise it evaluates another conditiona and so on, may be worth timing this compared to loc but it is a lot less readable
Another solution, if someone asked me to bet which one is the fastest I'd go for this:
joined.set_index("product").eval(
"1 * (price >= percentil_25)"
" + (price >= percentil_50)"
" + (price >= percentil_75)"
)