I'm cleaning a data set with 6 columns and just under 9k rows. As part of the clean up I have to find zero/negative, repetitive, interpolated, and outlier values defined as:
repetitive values - 3 subsequent values are equivalent up to 6 decimal places, flag the first one
interpolated values - take a = row1_val - row2_val, b = row2_val-row3_val, c = row3_val - row4_val, etc. If a=b or b=c, etc. flag
outlier values - 1.1peak < MW < 0.1peak
Right now I am using for loops on the data frame to do the row comparisons and flag the values, put them into a new data frame, and replace them with 999999 but it takes FOREVER. I used the following code to find and replace the zero/negative values, but I cant seem to make it work for the multi row functions used in the for loop. Can anyone show me how this works?
zero/negative values:
df = (df.drop(data_columns, axis=1).join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
Missing_Vals_df = df.loc[(df['A KW'].isnull()) | (df['A KVAR'].isnull()) | (df['B KW'].isnull()) | (df['B KVAR'].isnull()) | (df['C KW'].isnull()) | (df['C KVAR'].isnull())]
df = df.fillna(999999)
Loops:
for x in range(len(df)-2):
for thing in data_columns:
if df.loc[x][thing] <= 0:
df = df.replace(to_replace = df.loc[x][thing], value=999999)
elif (round(df.loc[x][thing], 6) == round(df.loc[x+1][thing], 6) == round(df.loc[x+2][thing], 6)) & (df.loc[x][thing] != 999999):
if x not in duplicate_loc:
duplicate_loc.append(x)
duplicate_df = duplicate_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
elif (round((df.loc[x+1][thing] - df.loc[x][thing]), 3) == round((df.loc[x+2][thing] - df.loc[x+1][thing]), 3)) & (df.loc[x][thing] != 999999):
if x not in interpolated_loc:
interpolated_loc.append(x)
interpolated_df = interpolated_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
elif ((df.loc[x][thing] > 1.1*df_peak.loc[0]['Value']) | (df.loc[x][thing] > 1.1*df_peak.loc[0]['Value']) | (df.loc[x][thing] > 1.1*df_peak.loc[0]['Value'])) & (df.loc[x][thing] != 999999):
if x not in outlier_loc:
outlier_loc.append(x)
outlier_df = outlier_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
Related
I have a dataframe of elements with a start and end datetime. What is the best option to find intersections of the dates? My naive approach right now consists of two nested loops cross-comparing the elements, which obviously is super slow. What would be a better way to achieve that?
dict = {}
start = "start_time"
end = "end_time"
for index1, rowLoop1 in df[{start, end}].head(500).iterrows():
matches = []
dict[(index1, rowLoop1[start])] = 0
for index2, rowLoop2 in df[{start,end}].head(500).iterrows():
if index1 != index2:
if date_intersection(rowLoop1[start], rowLoop1[end], rowLoop2[start], rowLoop2[end]):
dict[(index1, rowLoop1[start])] += 1
Code for date_intersection:
def date_intersection(t1start, t1end, t2start, t2end):
if (t1start <= t2start <= t2end <= t1end): return True
elif (t1start <= t2start <= t1end):return True
elif (t1start <= t2end <= t1end):return True
elif (t2start <= t1start <= t1end <= t2end):return True
else: return False
Sample data:
id,start_date,end_date
41234132,2021-01-10 10:00:05,2021-01-10 10:30:27
64564512,2021-01-10 10:10:00,2021-01-11 10:28:00
21135765,2021-01-12 12:30:00,2021-01-12 12:38:00
87643252,2021-01-12 12:17:00,2021-01-12 12:42:00
87641234,2021-01-12 12:58:00,2021-01-12 13:17:00
You can do something like merging your dataframe with itself to get the cartesian product and comparing columns.
df = df.merge(df, how='cross', suffixes=('','_2'))
df['date_intersection'] = (((df['start_date'].le(df['start_date_2']) & df['start_date_2'].le(df['end_date'])) | # start 2 within start/end
(df['start_date'].le(df['end_date_2']) & df['end_date_2'].le(df['end_date'])) | # end 2 within start/end
(df['start_date_2'].le(df['start_date']) & df['start_date'].le(df['end_date_2'])) | # start within start 2/end 2
(df['start_date_2'].le(df['end_date']) & df['end_date'].le(df['end_date_2']))) & # end within start 2/end 2
df['id'].ne(df['id_2'])) # id not compared to itself
and then to return the ids and if they have a date intersection...
df.groupby('id')['date_intersection'].any()
id
21135765 True
41234132 True
64564512 True
87641234 False
87643252 True
or if you need the ids that were intersected
df.loc[df['date_intersection'], :].groupby(['id'])['id_2'].agg(list).to_frame('intersected_ids')
intersected_ids
id
21135765 [87643252]
41234132 [64564512]
64564512 [41234132]
87643252 [21135765]
This mre gives throws TypeError: string indices must be integers
ActualClose = [898,1415,226,6006 ]
Predicted = [905,1426,229,6021]
Prior = [891,1351,228,5993]
df = pd.DataFrame(list(zip(ActualClose, Predicted, Prior)), columns =['Actual', 'Predicted', "Prior"])
count = 0
for row in df:
if row['Actual'] & row['Predicted'] > row['Prior'] | row['Actual'] & row['Predicted'] < row['Prior']:
count = count + 1
I don't understand where the strings are that would be interferring with running my logic, can someone clear this up for me?
Why not just use this one-liner instead:
count = ((df['Actual'] & df['Predicted'] > df['Prior']) | df['Actual'] & (df['Predicted'] < df['Prior'])).sum()
And now:
print(count)
Would give:
2
I think to iterate throw each row of a dataframe, you should use df.iterrows() instead of df
ActualClose = [898,1415,226,6006 ]
Predicted = [905,1426,229,6021]
Prior = [891,1351,228,5993]
df = pd.DataFrame(list(zip(ActualClose, Predicted, Prior)), columns =['Actual', 'Predicted', "Prior"])
count = 0
for _, row in df.iterrows():
if row['Actual'] & row['Predicted'] > row['Prior'] | row['Actual'] & row['Predicted'] < row['Prior']:
count = count + 1
my input(as example):
df = pd.DataFrame({'frame':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'sum_result_ICV':[0,1,1,1,2,2,2,2,1,1,1,1,1,1,1,0], 'sum_result_AO':[0,1,1,1,0,0,0,0,1,1,1,1,1,1,1,0]})
dd['result_ICV'] = 0
dd['result_ATO'] = 0
My code and my_func:
for z in range(0,len(cv_details)):
def result_func(row):
for i in range(0,len(dd)):
if row==2:
return(cv_details[z])
elif row==1:
if dd.loc[dd['sum_result_'+cv_details[z]]==2,'frame'].empty:
return ('ReviewNG-'+cv_details[z])
elif (dd['frame'][i]-dd.loc[dd.iloc[:,z+1]==2,'frame'].iloc[0]) <=3 :
return('NG-'+cv_details[z])
elif (dd['frame'][i]-dd.loc[dd.iloc[:,z+1]==2,'frame'].iloc[-1]) <=3 :
return('NG-'+cv_details[z])
else:
return('ReviewNG-'+cv_details[z])
elif row==0:
return('Other')
else:
return ""
dd.iloc[:,z+3]=dd.iloc[:,z+1].apply(result_func)
I expect:
But my output:
So as you can see: I need some condition, for example: "if sum_result_ICV equal 0 -> put "Other", if 'sum_result_ICV' equal 1 AND difference of (Number of Frame minus Number where first/last Frame==2) equal or less than 3 -> put "NG-ICV" in other wise "ReviewNG-ICV"(for example number of frame 11 where in sum_result_ICV was 1 and in distance from number 7 of frame where was sum_result_ICV equal 2, so 11-7>3 put "ReviewNG-ICV" ). In my example frame from 1 to 3 must be "NG-ICV", and also from 8 to 10. But from 11 to 14 it must be "ReviewBG-ICV". Also, please see pic that I expect from my function. So what I do wrong?
UPDATE based an answer of #woblob
That new code with loop:
for z in range(0,len(cv_details)):
df.iloc[df.iloc[:,z+1].to_numpy()==0, z+2 ] = 'Other'
mask2= df.iloc[:,z+1]==2
mask1 =df.iloc[:,z+1]==1
df.iloc[mask2,z+2]=cv_details[z]
if df.loc[mask2,'frame'].empty:
df.iloc[mask1,z+2]='ReviewNG-'+cv_details[z]
else:
df_frame_first=df.loc[mask2,'frame'].iloc[0]
df_frame_last=df.loc[mask2,'frame'] .iloc[-1]
mask_lt_3 = ((df.frame - df_frame_first) <= 3) | (df.frame - df_frame_last <= 3)
ones_lt_3 = mask1 & mask_lt_3
ones_not_lt_3 = mask1 & (~mask_lt_3)
df.iloc[ones_lt_3, z+2] = 'NG-'+cv_details[z]
df.iloc[ones_not_lt_3 , z+2] = 'ReviewNG-'+cv_details[z]
As I was trying to untangle the logic, I reworked it completely.
dd.loc[dd.result == 0, "sum_result"] = 'Other'
mask2 = dd.result == 2
mask1 = dd.result == 1
dd.loc[mask2, "sum_result"] = 'ICV'
if dd.loc[mask2,'frame'].empty:
dd.loc[mask1, "sum_result"] = 'No sum_result==2'
else:
dd_frame_first = dd.loc[mask2,'frame'].iloc[0]
dd_frame_last = dd.loc[mask2,'frame'].iloc[-1]
mask_lt_3 = ((dd.frame - dd_frame_first) <= 3) | (dd.frame - dd_frame_last <= 3)
ones_lt_3 = mask1 & mask_lt_3
ones_not_lt_3 = mask1 & (~mask_lt_3)
dd.loc[ones_lt_3, "sum_result"] = 'NG-ICV'
dd.loc[ones_not_lt_3 , "sum_result"] = 'ReviewNG-ICV'
what is the most efficient way of selecting value from pandas dataframe using column name and row index (by that I mean row number)?
I have a case where I have to iterate through rows:
I have a working solution:
i = 0
while i < len(dataset) -1:
if dataset.target[i] == 1:
dataset.sum_lost[i] = dataset['to_be_repaid_principal'][i] + dataset['to_be_repaid_interest'][i]
dataset.ratio_lost[i] = dataset.sum_lost[i] / dataset['expected_returned_sum'][i]
else:
dataset.sum_lost[i] = 0
dataset.ratio_lost[i]= 0
i += 1
But this solution is so much RAM hungry. I am also getting the following warning:
"A value is trying to be set on a copy of a slice from a DataFrame."
So I am trying to come up with another one:
i = 0
while i < len(dataset) -1:
if dataset.iloc[i, :].loc['target'] == 1:
dataset.iloc[i, :].loc['sum_lost'] = dataset.iloc[i, :].loc['to_be_repaid_principal'] + dataset.iloc[i, :].loc['to_be_repaid_interest']
dataset.iloc[i, :].loc['ratio_lost'] = dataset.iloc[i, :].loc['sum_lost'] / dataset.iloc[i, :].loc['expected_returned_sum']
else:
dataset.iloc[i, :].loc['sum_lost'] = 0
dataset.iloc[i, :].loc['ratio_lost'] = 0
i += 1
But it does not work.
I would like to come up with a faster/less ram hungry solution, because this will actually be web app a few users could use simultaneously.
Thanks a lot.
If you are thinking about "looping through rows", you are not using pandas right. You should think of terms of columns instead.
Use np.where which is vectorized (read: fast):
cond = dataset['target'] == 1
dataset['sumlost'] = np.where(cond, dataset['to_be_repaid_principal'] + dataset['to_be_repaid_interest'], 0)
dataset['ratio_lost'] = np.where(cond, dataset['sumlost'] / dataset['expected_returned_sum'], 0)
I have a pandas DataFrame named Joined with 5 fields:
product | price | percentil_25 | percentil_50 | percentile_75
for each row I want to class the price like this:
if the price is below percentil_25 I'm giving to this product the class 1, and so on
So what I did is:
classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []
for index in range(len(joined)):
classe_final['sku'].append(joined.values[index][0])
if(float(joined.values[index][1]) <= float(joined.values[index][2])):
classe_final['class'].append(1)
elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
classe_final['class'].append(2)
elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
classe_final['class'].append(3)
else:
classe_final['class'].append(4)
But as my DataFrame is quite big it's taking forever.
Do you have any idea how I could do this quicker?
# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()
2 ways, define a func and call apply
def class(x):
if x.price < x.percentil_25:
return 1
elif x.price >= x.percentil_25 and x.price < x.percentil_50:
return 2:
elif x.price >= x.percentil_50 and x.price < x.percentil_75:
return 2:
elif x.price >= x.percentil_75:
return 4
df['class'] = other_df.apply(lambda row: class(row'), axis=1)
another way which I think is better and will be much faster is we could add the 'class' column to your existing df and use loc and then just take a view of the 2 columns of interest:
joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4
classe_final = joined[['cku', 'class']]
Just for kicks you could use a load of np.where conditions:
classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where( joined['price'] > joined['percentil_50'], 3, np.where( joined['price'] > joined['percentil_25'], 2, 1 ) ) )
this evaluates whether the price is greater than percentil_75, if so then class 4 otherwise it evaluates another conditiona and so on, may be worth timing this compared to loc but it is a lot less readable
Another solution, if someone asked me to bet which one is the fastest I'd go for this:
joined.set_index("product").eval(
"1 * (price >= percentil_25)"
" + (price >= percentil_50)"
" + (price >= percentil_75)"
)