Optimizing pandas operations

Optimizing pandas operations - python

I'm cleaning a data set with 6 columns and just under 9k rows. As part of the clean up I have to find zero/negative, repetitive, interpolated, and outlier values defined as:
repetitive values - 3 subsequent values are equivalent up to 6 decimal places, flag the first one
interpolated values - take a = row1_val - row2_val, b = row2_val-row3_val, c = row3_val - row4_val, etc. If a=b or b=c, etc. flag
outlier values - 1.1peak < MW < 0.1peak
Right now I am using for loops on the data frame to do the row comparisons and flag the values, put them into a new data frame, and replace them with 999999 but it takes FOREVER. I used the following code to find and replace the zero/negative values, but I cant seem to make it work for the multi row functions used in the for loop. Can anyone show me how this works?
zero/negative values:
df = (df.drop(data_columns, axis=1).join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
Missing_Vals_df = df.loc[(df['A KW'].isnull()) | (df['A KVAR'].isnull()) | (df['B KW'].isnull()) | (df['B KVAR'].isnull()) | (df['C KW'].isnull()) | (df['C KVAR'].isnull())]
df = df.fillna(999999)
Loops:
for x in range(len(df)-2):
for thing in data_columns:
if df.loc[x][thing] <= 0:
df = df.replace(to_replace = df.loc[x][thing], value=999999)
elif (round(df.loc[x][thing], 6) == round(df.loc[x+1][thing], 6) == round(df.loc[x+2][thing], 6)) & (df.loc[x][thing] != 999999):
if x not in duplicate_loc:
duplicate_loc.append(x)
duplicate_df = duplicate_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
elif (round((df.loc[x+1][thing] - df.loc[x][thing]), 3) == round((df.loc[x+2][thing] - df.loc[x+1][thing]), 3)) & (df.loc[x][thing] != 999999):
if x not in interpolated_loc:
interpolated_loc.append(x)
interpolated_df = interpolated_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
elif ((df.loc[x][thing] > 1.1*df_peak.loc[0]['Value']) | (df.loc[x][thing] > 1.1*df_peak.loc[0]['Value']) | (df.loc[x][thing] > 1.1*df_peak.loc[0]['Value'])) & (df.loc[x][thing] != 999999):
if x not in outlier_loc:
outlier_loc.append(x)
outlier_df = outlier_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)

Related

Find intersection in Dataframe with Start and Enddate in Python

I have a dataframe of elements with a start and end datetime. What is the best option to find intersections of the dates? My naive approach right now consists of two nested loops cross-comparing the elements, which obviously is super slow. What would be a better way to achieve that?
dict = {}
start = "start_time"
end = "end_time"
for index1, rowLoop1 in df[{start, end}].head(500).iterrows():
matches = []
dict[(index1, rowLoop1[start])] = 0
for index2, rowLoop2 in df[{start,end}].head(500).iterrows():
if index1 != index2:
if date_intersection(rowLoop1[start], rowLoop1[end], rowLoop2[start], rowLoop2[end]):
dict[(index1, rowLoop1[start])] += 1
Code for date_intersection:
def date_intersection(t1start, t1end, t2start, t2end):
if (t1start <= t2start <= t2end <= t1end): return True
elif (t1start <= t2start <= t1end):return True
elif (t1start <= t2end <= t1end):return True
elif (t2start <= t1start <= t1end <= t2end):return True
else: return False
Sample data:
id,start_date,end_date
41234132,2021-01-10 10:00:05,2021-01-10 10:30:27
64564512,2021-01-10 10:10:00,2021-01-11 10:28:00
21135765,2021-01-12 12:30:00,2021-01-12 12:38:00
87643252,2021-01-12 12:17:00,2021-01-12 12:42:00
87641234,2021-01-12 12:58:00,2021-01-12 13:17:00

You can do something like merging your dataframe with itself to get the cartesian product and comparing columns.
df = df.merge(df, how='cross', suffixes=('','_2'))
df['date_intersection'] = (((df['start_date'].le(df['start_date_2']) & df['start_date_2'].le(df['end_date'])) | # start 2 within start/end
(df['start_date'].le(df['end_date_2']) & df['end_date_2'].le(df['end_date'])) | # end 2 within start/end
(df['start_date_2'].le(df['start_date']) & df['start_date'].le(df['end_date_2'])) | # start within start 2/end 2
(df['start_date_2'].le(df['end_date']) & df['end_date'].le(df['end_date_2']))) & # end within start 2/end 2
df['id'].ne(df['id_2'])) # id not compared to itself
and then to return the ids and if they have a date intersection...
df.groupby('id')['date_intersection'].any()
id
21135765 True
41234132 True
64564512 True
87641234 False
87643252 True
or if you need the ids that were intersected
df.loc[df['date_intersection'], :].groupby(['id'])['id_2'].agg(list).to_frame('intersected_ids')
intersected_ids
id
21135765 [87643252]
41234132 [64564512]
64564512 [41234132]
87643252 [21135765]

Where are the strings causing my error in this df?

This mre gives throws TypeError: string indices must be integers
ActualClose = [898,1415,226,6006 ]
Predicted = [905,1426,229,6021]
Prior = [891,1351,228,5993]
df = pd.DataFrame(list(zip(ActualClose, Predicted, Prior)), columns =['Actual', 'Predicted', "Prior"])
count = 0
for row in df:
if row['Actual'] & row['Predicted'] > row['Prior'] | row['Actual'] & row['Predicted'] < row['Prior']:
count = count + 1
I don't understand where the strings are that would be interferring with running my logic, can someone clear this up for me?

Why not just use this one-liner instead:
count = ((df['Actual'] & df['Predicted'] > df['Prior']) | df['Actual'] & (df['Predicted'] < df['Prior'])).sum()
And now:
print(count)
Would give:
2

I think to iterate throw each row of a dataframe, you should use df.iterrows() instead of df
ActualClose = [898,1415,226,6006 ]
Predicted = [905,1426,229,6021]
Prior = [891,1351,228,5993]
df = pd.DataFrame(list(zip(ActualClose, Predicted, Prior)), columns =['Actual', 'Predicted', "Prior"])
count = 0
for _, row in df.iterrows():
if row['Actual'] & row['Predicted'] > row['Prior'] | row['Actual'] & row['Predicted'] < row['Prior']:
count = count + 1

apply my function if...else statement with condition doesn't pass

my input(as example):
df = pd.DataFrame({'frame':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'sum_result_ICV':[0,1,1,1,2,2,2,2,1,1,1,1,1,1,1,0], 'sum_result_AO':[0,1,1,1,0,0,0,0,1,1,1,1,1,1,1,0]})
dd['result_ICV'] = 0
dd['result_ATO'] = 0
My code and my_func:
for z in range(0,len(cv_details)):
def result_func(row):
for i in range(0,len(dd)):
if row==2:
return(cv_details[z])
elif row==1:
if dd.loc[dd['sum_result_'+cv_details[z]]==2,'frame'].empty:
return ('ReviewNG-'+cv_details[z])
elif (dd['frame'][i]-dd.loc[dd.iloc[:,z+1]==2,'frame'].iloc[0]) <=3 :
return('NG-'+cv_details[z])
elif (dd['frame'][i]-dd.loc[dd.iloc[:,z+1]==2,'frame'].iloc[-1]) <=3 :
return('NG-'+cv_details[z])
else:
return('ReviewNG-'+cv_details[z])
elif row==0:
return('Other')
else:
return ""
dd.iloc[:,z+3]=dd.iloc[:,z+1].apply(result_func)
I expect:
But my output:
So as you can see: I need some condition, for example: "if sum_result_ICV equal 0 -> put "Other", if 'sum_result_ICV' equal 1 AND difference of (Number of Frame minus Number where first/last Frame==2) equal or less than 3 -> put "NG-ICV" in other wise "ReviewNG-ICV"(for example number of frame 11 where in sum_result_ICV was 1 and in distance from number 7 of frame where was sum_result_ICV equal 2, so 11-7>3 put "ReviewNG-ICV" ). In my example frame from 1 to 3 must be "NG-ICV", and also from 8 to 10. But from 11 to 14 it must be "ReviewBG-ICV". Also, please see pic that I expect from my function. So what I do wrong?
UPDATE based an answer of #woblob
That new code with loop:
for z in range(0,len(cv_details)):
df.iloc[df.iloc[:,z+1].to_numpy()==0, z+2 ] = 'Other'
mask2= df.iloc[:,z+1]==2
mask1 =df.iloc[:,z+1]==1
df.iloc[mask2,z+2]=cv_details[z]
if df.loc[mask2,'frame'].empty:
df.iloc[mask1,z+2]='ReviewNG-'+cv_details[z]
else:
df_frame_first=df.loc[mask2,'frame'].iloc[0]
df_frame_last=df.loc[mask2,'frame'] .iloc[-1]
mask_lt_3 = ((df.frame - df_frame_first) <= 3) | (df.frame - df_frame_last <= 3)
ones_lt_3 = mask1 & mask_lt_3
ones_not_lt_3 = mask1 & (~mask_lt_3)
df.iloc[ones_lt_3, z+2] = 'NG-'+cv_details[z]
df.iloc[ones_not_lt_3 , z+2] = 'ReviewNG-'+cv_details[z]

As I was trying to untangle the logic, I reworked it completely.
dd.loc[dd.result == 0, "sum_result"] = 'Other'
mask2 = dd.result == 2
mask1 = dd.result == 1
dd.loc[mask2, "sum_result"] = 'ICV'
if dd.loc[mask2,'frame'].empty:
dd.loc[mask1, "sum_result"] = 'No sum_result==2'
else:
dd_frame_first = dd.loc[mask2,'frame'].iloc[0]
dd_frame_last = dd.loc[mask2,'frame'].iloc[-1]
mask_lt_3 = ((dd.frame - dd_frame_first) <= 3) | (dd.frame - dd_frame_last <= 3)
ones_lt_3 = mask1 & mask_lt_3
ones_not_lt_3 = mask1 & (~mask_lt_3)
dd.loc[ones_lt_3, "sum_result"] = 'NG-ICV'
dd.loc[ones_not_lt_3 , "sum_result"] = 'ReviewNG-ICV'

What is the correct way of selecting value from pandas dataframe using column name and row index?

what is the most efficient way of selecting value from pandas dataframe using column name and row index (by that I mean row number)?
I have a case where I have to iterate through rows:
I have a working solution:
i = 0
while i < len(dataset) -1:
if dataset.target[i] == 1:
dataset.sum_lost[i] = dataset['to_be_repaid_principal'][i] + dataset['to_be_repaid_interest'][i]
dataset.ratio_lost[i] = dataset.sum_lost[i] / dataset['expected_returned_sum'][i]
else:
dataset.sum_lost[i] = 0
dataset.ratio_lost[i]= 0
i += 1
But this solution is so much RAM hungry. I am also getting the following warning:
"A value is trying to be set on a copy of a slice from a DataFrame."
So I am trying to come up with another one:
i = 0
while i < len(dataset) -1:
if dataset.iloc[i, :].loc['target'] == 1:
dataset.iloc[i, :].loc['sum_lost'] = dataset.iloc[i, :].loc['to_be_repaid_principal'] + dataset.iloc[i, :].loc['to_be_repaid_interest']
dataset.iloc[i, :].loc['ratio_lost'] = dataset.iloc[i, :].loc['sum_lost'] / dataset.iloc[i, :].loc['expected_returned_sum']
else:
dataset.iloc[i, :].loc['sum_lost'] = 0
dataset.iloc[i, :].loc['ratio_lost'] = 0
i += 1
But it does not work.
I would like to come up with a faster/less ram hungry solution, because this will actually be web app a few users could use simultaneously.
Thanks a lot.

If you are thinking about "looping through rows", you are not using pandas right. You should think of terms of columns instead.
Use np.where which is vectorized (read: fast):
cond = dataset['target'] == 1
dataset['sumlost'] = np.where(cond, dataset['to_be_repaid_principal'] + dataset['to_be_repaid_interest'], 0)
dataset['ratio_lost'] = np.where(cond, dataset['sumlost'] / dataset['expected_returned_sum'], 0)

python, operation on big pandas Dataframe

I have a pandas DataFrame named Joined with 5 fields:
product | price | percentil_25 | percentil_50 | percentile_75
for each row I want to class the price like this:
if the price is below percentil_25 I'm giving to this product the class 1, and so on
So what I did is:
classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []
for index in range(len(joined)):
classe_final['sku'].append(joined.values[index][0])
if(float(joined.values[index][1]) <= float(joined.values[index][2])):
classe_final['class'].append(1)
elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
classe_final['class'].append(2)
elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
classe_final['class'].append(3)
else:
classe_final['class'].append(4)
But as my DataFrame is quite big it's taking forever.
Do you have any idea how I could do this quicker?

# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()
2 ways, define a func and call apply
def class(x):
if x.price < x.percentil_25:
return 1
elif x.price >= x.percentil_25 and x.price < x.percentil_50:
return 2:
elif x.price >= x.percentil_50 and x.price < x.percentil_75:
return 2:
elif x.price >= x.percentil_75:
return 4
df['class'] = other_df.apply(lambda row: class(row'), axis=1)
another way which I think is better and will be much faster is we could add the 'class' column to your existing df and use loc and then just take a view of the 2 columns of interest:
joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4
classe_final = joined[['cku', 'class']]
Just for kicks you could use a load of np.where conditions:
classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where( joined['price'] > joined['percentil_50'], 3, np.where( joined['price'] > joined['percentil_25'], 2, 1 ) ) )
this evaluates whether the price is greater than percentil_75, if so then class 4 otherwise it evaluates another conditiona and so on, may be worth timing this compared to loc but it is a lot less readable

Another solution, if someone asked me to bet which one is the fastest I'd go for this:
joined.set_index("product").eval(
"1 * (price >= percentil_25)"
" + (price >= percentil_50)"
" + (price >= percentil_75)"
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing pandas operations - python

Related

Find intersection in Dataframe with Start and Enddate in Python

Where are the strings causing my error in this df?

apply my function if...else statement with condition doesn't pass

What is the correct way of selecting value from pandas dataframe using column name and row index?

python, operation on big pandas Dataframe

Categories

Resources