Vectorized solution for pandas nested iterrows - python

Given an example dataframe:
example_df = pd.DataFrame({"app_id": [1,2,3,4,5,6] ,
"payment_date":["2021-01-01", "2021-02-01", "2020-03-02", "2020-04-05", "2020-01-05","2020-01-04"],
"user_id": [12,12,12,13,13,13],
"application_date":["2021-02-01", "2021-02-01", "2020-03-02", "2020-04-05", "2020-01-05", "2020-01-04"] , "flag": [1,0,0,1,0,1], "order_column": [1,2,3,4,5, 6]})
What should be done is:
I will explain what I want to do with an example:
Iterate through all rows
If the flag column is equal to 1 do as stated below
For the first row flag column is 1 and the user_id for the row is 12. Look at all instances with user_id= 12 and compare their application_date with the payment_date of the first row. We see that the second row has an application_date greater than the payment_date of the first row. Then the label of the first row is 1. Third row also belongs to user_id= 12 but its application_date is not greater than the payment_date of the first row. If there is one or more observation that has application_date greater than payment_date of the first row, overall label of the first row is 1. If there are no such observations the overall label is 0.
I wrote the code for this with iterrows, but I want a more compact vectorized solution since iterrows can be slow with larger datasets. Like
example_df.groupby("something").filter(lambda row: row. ...)
My code is:
labels_dict = {}
for idx, row in example_df.iterrows():
if row.flag == 1:
app_id = row.app_id
user_id = row.user_id
user_df = example_df[example_df.user_id == user_id]
labelss = []
for idx2, row2 in user_df.iterrows():
if (row2.order_column != row.order_column) & (row.payment_date < row2.application_date):
label = 1
labelss.append(label)
elif (row2.order_column != row.order_column) & (row.payment_date >= row2.application_date):
label = 0
labelss.append(label)
labels_dict[app_id] = labelss
final_labels = {}
for key, value in labels_dict.items():
if 1 in value:
final_labels[key] = 1
else:
final_labels[key] = 0
final_labels is the expected output. Basically, I am asking for all rows with flag= 1 to be labelled as 1 or 0 given the criteria I explained.
Desired output :
{1: 1, 4: 0, 6: 1}
Here keys are app_id and values are labels (either 0 or 1)

I would first built a temp dataframe with the only rows having 1 in flag and merge it with the full dataframe on user_id.
Then I will add a new boolean column being true if application_date is greater than payment_date and if the original app_id is different from the on from temp (ie different rows)
Finally it will be enough to count the number of true values per app_id and give a 1 if the number is greater than 0.
Pandas code could be:
tmp = example_df.loc[example_df['flag'] == 1,
['app_id', 'user_id', 'payment_date']]
tmp = tmp.merge(example_df.drop(columns = 'payment_date'), on='user_id')
tmp['k'] = ((tmp['app_id_x'] != tmp['app_id_y'])
& (tmp['application_date'] > tmp['payment_date']))
d = (tmp.groupby('app_id_x')['k'].sum() != 0).astype('int').to_dict()
With your data, it gives as expected:
{1: 1, 4: 0, 6: 1}

(i) Convert all dates to datetime objects
(ii) groupby "user_id" and for each group find the first "payment_date" using first and transform it for the entire DataFrame. Then compare it with the "application_date"s using lt (less than).
(iii) groupby "user_id" again to find how many entries satisfy the condition and assign values depending on whether the sum is greater than 1 or not.
example_df['payment_date'] = pd.to_datetime(example_df['payment_date'])
example_df['application_date'] = pd.to_datetime(example_df['application_date'])
example_df['flag_cumsum'] = example_df['flag'].cumsum()
example_df['first_payment_date < application_date'] = (example_df
.groupby(['flag_cumsum','user_id'])['payment_date']
.transform('first')
.lt(example_df['application_date']))
out = (example_df.groupby('flag_cumsum').agg({'app_id':'first',
'first_payment_date < application_date':'sum'})
.set_index('app_id')['first_payment_date < application_date']
.gt(0).astype(int)
.to_dict())
Output:
{1: 1, 4: 0}

Related

Im getting a different output than expected when using df.loc to change some values of the df

I have a data frame, and I want to assign a quartile number based on the quartile variable, which gives me the ranges that I later use in the for. The problem is that instead of just changing the quartile number, it its creating n (len of the datframe) rows, and then using the row number for the loop.
expected result
actual output
quartile = numpy.quantile(pivot['AHT'], [0.25,0.5,0.75])
pivot['Quartile'] = 0
for i in range(0,len(pivot)-1):
if i <= quartile[0]:
pivot.loc[i,'Quartile'] = 1
elif i <= quartile[1]:
pivot.loc[i,'Quartile'] = 2
elif i <= quartile[2]:
pivot.loc[i,'Quartile'] = 3
else:
pivot.loc[i,'Quartile'] = 4
Use qcut with labels=False and add 1 or specify values of labels in list:
pivot['Quartile'] = pd.qcut(pivot['AHT'], 4, labels=False) + 1
pivot['Quartile'] = pd.qcut(pivot['AHT'], 4, labels=[1,2,3,4])

Edit entire column in dataframe based on value in last row of that column

I have a dataframe, and I want to set the entire value of the column to 0 if the value in the last row of that column is <0.
The current code i have is:
ccy.loc[(ccy['CumProfit'].iloc[-1] < 0), 'CumProfit'] = 0
Use if:
if (ccy['CumProfit'].iloc[-1] < 0):
ccy['CumProfit'] = 0
Or if-else for one line solution:
ccy['CumProfit'] = ccy['CumProfit'] if ccy['CumProfit'].iloc[-1] < 0 else 0

Add a value in a column as a function of the timestamp and another column

The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0

How do I label data based on the values of the previous row?

I want to label the data "1" if the current value is higher than that of the previous row and "0" otherwise.
Lets say I have this DataFrame:
df = pd.DataFrame({'date': [1,2,3,4,5], 'price': [50.125, 45.25, 65.857, 100.956, 77.4152]})
and I want the output as if the DataFrame is constructed like this:
df = pd.DataFrame({'date': [1,2,3,4,5], 'price': [50.125, 45.25, 65.857, 100.956, 77.4152], 'label':[0, 0, 1, 1, 0]})
*I don't know how to post a DataFrame
These code is my attempt:
df['label'] = 0
i = 0
for price in df['price']:
i = i+1
if price in i > price: #---> right now I am stuck here. i=It says argument of type 'int' is not iterable
df.append['label', 1]
elif price in i <= price:
df.append['label', 0]
I think there are also other logical mistakes in my codes. What am I doing wrong?
Create boolean mask by Series.ge (>=) with Series.shift and convert to integers for True/False to 1/0 mapping by Series.view:
df['label'] = df['price'].ge(df['price'].shift()).view('i1')
Or by Series.astype:
df['label'] = df['price'].ge(df['price'].shift()).astype(int)
IIUC np.where with a boolean shift to see the change in the row price and if it's greater than the row above.
df['label'] = np.where(df['price'].ge(df['price'].shift()),1,0)
print(df)
date price label
0 1 50.1250 0
1 2 45.2500 0
2 3 65.8570 1
3 4 100.9560 1
4 5 77.4152 0
Explanation:
print(df['price'].ge(df['price'].shift()))
returns a boolean of True and False statements that we can use in our where clause
0 False
1 False
2 True
3 True
4 False
To explain what is happening in your code:
df['label'] should be initiated to an empty list, not "0". If you want to set the first value of the list to 0, you can do df['label'] = [0].
i is just the index value (0, 1, 2, 3...) and not the value of the price at a specific index (50.125, 45.25, 65.857...) , so it is not what you want to compare.
price in is used to check if the value of the price variable exists in a list that follows. The in statement isn't followed with a list, so you will get an error. You want to instead get the price value at a specific index and compare if it is greater or less than the value at the previous index.
The append method uses () and not [].
If you want to continue along your method of using a loop, the following can work:
df['label'] = []
for i in range(len(df['price'])):
if df['price'][i] > df['price'][i - 1]:
df['label'].append(1)
else:
df['label'].append(0)
What this does is loop through the range of the length of the price list. It then compares the values of the price at position i and position i - 1.
You can also further simplify the if/else statement using a ternary operator to:
df['label'].append(1 if df['price'][i] > df['price'][i - 1] else 0)
Working fiddle: https://repl.it/repls/HoarseImpressiveScientists

Counting number of null values below and placing them in new df

df
I'm attempting to count the number of null values below each non-null cell in a dataframe and put the number into a new variable (size) and data frame.
I have included a picture of the dataframe I'm trying to count. I'm only interested in the Arrival Date Column for now. The new data frame should have a column that has 1,1,3,7..etc as it's first observations.
##Loops through all of rows in DOAs
for i in range(0, DOAs.shape[0]):
j=0
if DOAs.iloc[int(i),3] != None: ### the rest only runs if the current, i, observation isn't null
newDOAs.iloc[int(j),0] = DOAs.iloc[int(i),3] ## sets the jth i in the new dataframe to the ith (currently assessed) row of the old
foundNull = True #Sets foundNull equal to true
k=1 ## sets the counter of people
while foundNull == True and (k+i) < 677:
if DOAs.iloc[int(i+k),3] == None: ### if the next one it looks at is null, increment the counter to add another person to the family
k = k+1
else:
newDOAs.iloc[int(j),1] = k ## sets second column in new dataframe equal to the size
j = j+1
foundNull = False
j=0
What you can do is get the indices of non-null entries in whatever column of your dataframe, then get the distances between each. Note: This is assuming they are nicely ordered and/or you don't mind calling .reset_index() on your dataframe.
Here is a sample:
df = pd.DataFrame({'a': [1, None, None, None, 2, None, None, 3, None, None]})
not_null_index = df.dropna(subset=['a']).index
null_counts = {}
for i in range(len(not_null_index)):
if i < len(not_null_index) - 1:
null_counts[not_null_index[i]] = not_null_index[i + 1] - 1 - not_null_index[i]
else:
null_counts[not_null_index[i]] = len(df.a) - 1 - not_null_index[i]
null_counts_df = pd.DataFrame({'nulls': list(null_counts.values())}, index=null_counts.keys())
df_with_null_counts = pd.merge(df, null_counts_df, left_index=True, right_index=True)
Essentially all this code does is get the indices of non-null values in the dataframe, then gets the difference between each index and the next non-null index and puts that in the column. Then sticks those null_counts in a dataframe and merges it with the original.
After running this snippet, df_with_null_counts is equal to:
a nulls
0 1.0 3
4 2.0 2
7 3.0 2
Alternatively, you can use numpy instead of using a loop, which would be much faster for large dataframes. Here's a sample:
df = pd.DataFrame({'a': [1, None, None, None, 2, None, None, 3, None, None]})
not_null_index = df.dropna(subset=['a']).index
offset_index = np.array([*not_null_index[1:], len(df.a)])
null_counts = offset_index - np.array(not_null_index) - 1
null_counts_df = pd.DataFrame({'nulls': null_counts}, index=not_null_index)
df_with_null_counts = pd.merge(df, null_counts_df, left_index=True, right_index=True)
And the output will be the same.

Categories