Test for First Occurrence of Conditions in Python Dataframe - python

Background
Pretty new to Python and dataframes. I'm on a Mac (Sierra) running Jupyter Notebook in Firefox (87.0). I've got a dataframe like this:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],
'SubGroup':[1,5,6,5,8,6,8,6,6,5],
'Price':[7,1,0,10,2,3,0,0,10,0]})
A SubGroup Price
0 1 1 7
1 2 5 1
2 3 6 0
3 4 5 10
4 5 8 2
5 6 6 3
6 7 8 0
7 8 6 0
8 9 6 10
9 10 5 0
I want to add a Boolean column to this dataframe that checks whether a) the price in this row is zero and b) if it's the first occurrence of a zero price for this subgroup (reading from top to bottom). If both (a) and (b) are true, then return true, otherwise false. So it should look like this:
A SubGroup Price Test
0 1 1 7 False
1 2 5 1 False
2 3 6 0 True
3 4 5 10 False
4 5 8 2 False
5 6 6 3 False
6 7 8 0 True
7 8 6 0 False
8 9 6 10 False
9 10 5 0 True
What I've Tried
The first condition (Price == 0) is easy. Checking whether it's the first occurrence for the subgroup is where I could use some help. I have an Excel background, so I started by thinking about how to solve this using a MINIFS function. The idea was to find the minimum Price for the Subgroup, looking only at the rows above the current row. If that min was greater than zero, then I'd know this was the first zero occurrence. The closest I could find (from this post) was a line like...
df['subgroupGlobalMin'] = df.groupby('SubGroup')['Price'].transform('min')
...which works but takes a global minimum across all rows for the Subgroup, not just the ones above the current row. So I tried to specify the target range for my min using iloc, like this...
df['subgroupPreviousMin'] = df.iloc[:df.index].groupby('SubGroup')['Price'].transform('min')
...but this produces the error "cannot do positional indexing on RangeIndex with these indexers [RangeIndex(start=0, stop=10, step=1)] of type RangeIndex". I couldn't figure out how to dynamically specify my rows/indices.
So I changed strategies and instead tried to find the index of the first occurrence of the minimum value for a subgroup using idxmin (like this post):
df['minIndex'] = df.groupby(['SubGroup'])[['Price']].idxmin()
The plan was to check this against the current row index with df.index, but I get unexpected output here:
A SubGroup Price minIndex
0 1 1 7 NaN
1 2 5 1 0.0
2 3 6 0 NaN
3 4 5 10 NaN
4 5 8 2 NaN
5 6 6 3 9.0
6 7 8 0 2.0
7 8 6 0 NaN
8 9 6 10 6.0
9 10 5 0 NaN
I know what it's doing here, but I don't know why or how to fix it.
Questions
Which strategy is best for what I'm trying to achieve - using a min function, checking the index with something like idxmin, or something else?
How should I add a column to my dataframe that checks if the price is 0 for that row and if it's the first occurrence of a zero for that subgroup?

Let us try your logic:
is_zero = df.Price.eq(0)
is_first_zero = is_zero.groupby(df['SubGroup']).cumsum().eq(1)
df['Test'] = is_zero & is_first_zero
Output:
A SubGroup Price Test
0 1 1 7 False
1 2 5 1 False
2 3 6 0 True
3 4 5 10 False
4 5 8 2 False
5 6 6 3 False
6 7 8 0 True
7 8 6 0 False
8 9 6 10 False
9 10 5 0 True

Related

How to vectorize a pandas dataframe calculation where if a conditional is not met the data from the previous row is entered?

Currently I am using a for loop with a conditional that when true performs a calculation and inputs it into the column of a dataframe. However, when the conditional is not met the data from the previous row is entered into the new row.
This is a pseudocode of what I currently have:
for index in range(len(dataframe[column1])):
if condition==True:
dataframe.at[index, column3]= dataframe.at[index, column1]-dataframe.at[index, column2]
else:
dataframe.at[index, column3]= dataframe.at[index-1, column3]
I understand that when the calculation of the current row depends on the previous row, vectorization usually is not viable. However in this case, since the calculation for column 3 does not depend on the previous row and I am simply inputting the previous row's value into the current row, would it be possible to vectorize this to improve runtime speed?
You could do that in a vectorized way like this.
Starting Data
c0 c1 c2
0 5 2 4
1 5 10 6
2 9 3 2
3 1 4 2
4 4 2 7
5 1 5 8
6 3 4 6
7 10 1 3
8 4 2 6
9 3 1 2
Execute
import numpy as np
dfc = df.assign(c3=np.where(df['c0']>2, df['c1']-df['c2'], np.nan)).ffill().fillna(0).astype(int)
print(dfc)
Result
c0 c1 c2 c3
0 5 2 4 -2
1 5 10 6 4
2 9 3 2 1
3 1 4 2 1
4 4 2 7 -5
5 1 5 8 -5
6 3 4 6 -2
7 10 1 3 -2
8 4 2 6 -4
9 3 1 2 -1
This leverages Numpy's where function to do the selection. If the condition is true then it does the subtraction. If not true it places an NA value temporarily into the cell. Then ffill does a forward fill of values which completes the logic of placing the previous value of the column into a cell if the condition is not true. Note that fillna(0) place a zero in the first row if the condition is not met in the first row - since it does not have a previous value to place.

Pandas Constant Values after each Zero Value

Say I have the following dataframe:
values
0 4
1 0
2 2
3 3
4 0
5 8
6 5
7 1
8 0
9 4
10 7
I want to find a pandas vectorized function (preferably using groupby) that would replace all nonzero values with the first nonzero value in that chunk of nonzero values, i.e. something that would give me
values new
0 4 4
1 0 0
2 2 2
3 3 2
4 0 0
5 8 8
6 5 8
7 1 8
8 0 0
9 4 4
10 7 4
Is there a good way of achieving this?
Make a boolean mask to select the rows having zero and its following row, then use this boolean mask with where to replace remaining values with NaN, then use forward fill to propagate the values in forward direction.
m = df['values'].eq(0)
df['new'] = df['values'].where(m | m.shift()).ffill().fillna(df['values'])
Result
print(df)
values new
0 4 4.0
1 0 0.0
2 2 2.0
3 3 2.0
4 0 0.0
5 8 8.0
6 5 8.0
7 1 8.0
8 0 0.0
9 4 4.0
10 7 4.0
get rows for zeros, and the rows immediately after:
zeros = df.index[df['values'].eq(0)]
after_zeros = zeros.union(zeros +1)
Get the rows that need to be forward filled:
replace = df.index.difference(after_zeros)
replace = replace[replace > zeros[0]]
Set values and forward fill on replace:
df['new'] = df['values']
df.loc[replace, 'new'] = np.nan
df.ffill()
values new
0 4 4.0
1 0 0.0
2 2 2.0
3 3 2.0
4 0 0.0
5 8 8.0
6 5 8.0
7 1 8.0
8 0 0.0
9 4 4.0
10 7 4.0
The following function should do the job for you. Check the comments in the function to understand the work flow of the solution.
import pandas as pd
def ffill_nonZeros(values):
# get the values that are not equal to 0
non_zero = values[df['values'] != 0]
# get their indexes
non_zero_idx = non_zero.index.to_series()
# find where indexes are consecutive
diff = non_zero_idx.diff()
mask = diff == 1
# using the mask make all places in non_zero where the change is consecutive equal None
non_zero[mask] = None
# fill forward (replace all None values with previous valid value)
new_non_zero = non_zero.fillna(method='ffill')
# put new values back in their indexs
new = values.copy()
new[new_non_zero.index] = new_non_zero
return new
Now applying this function to your data:
df = pd.DataFrame([4, 0, 2, 3, 0, 8, 5, 1, 0, 4, 7], columns=['values'])
df['new'] = ffill_nonZeros(df['values'])
print(df)
Output:
values new
0 4 4
1 0 0
2 2 2
3 3 2
4 0 0
5 8 8
6 5 8
7 1 8
8 0 0
9 4 4
10 7 4

Assign the frequency of each value to dataframe with new column

I try to set up a Dataframe that countains a column called frequency.
This column should show how often the value is present in a specific column of the dataframe in every row. Something like this:
Index Category Frequency
0 1 1
1 3 2
2 3 2
3 4 1
4 7 3
5 7 3
6 7 3
7 8 1
This is just an example
I already tried it with value_counts(), however I only receive a value in the last line of the appearing number.
In the case of the example
Index Category Frequency
0 1 1
1 3 N.A
2 3 2
3 4 1
4 7 N.A
5 7 N.A
6 7 3
7 8 1
It is very important that the column has the same number of rows as the dataframe, preferably appended to the same dataframe
df['Frequency'] = df.groupby('Category').transform('count')
Use pandas.Series.map:
df['Frecuency']=df['Category'].map(df['Category'].value_counts())
or pandas.Series.replace:
df['Frecuency']=df['Category'].replace(df['Category'].value_counts())
Output:
Index Category Frecuency
0 0 1 1
1 1 3 2
2 2 3 2
3 3 4 1
4 4 7 3
5 5 7 3
6 6 7 3
7 7 8 1
Details
df['Category'].value_counts()
7 3
3 2
4 1
1 1
8 1
Name: Category, dtype: int64
using value_counts you get a series whose index are the elements of the category and the values ​​is the count. So you can use map or pandas.Series.replace to create a series with the category values ​​replaced by those in the count. And finally assign this series to the frequency column
you can do it using group by like below
df.groupby("Category") \
.apply(lambda g: g.assign(frequency = len(g))) \
.reset_index(level=0, drop=True)

how can i assign other values for coressponding value in dataframe column

I have a pandas dataframe column that contains a different numbers and each number has a different frequencies. there are 532 to unique value in column. and totally 59000 value are there.
0 135715
1 138775
2 134915
3 134335
4 134555
5 144995
6 136515
7 135185
8 145555
9 135245
...
How can i replace these values somehow corresponding to values in column that starts from 1 to 532. something like this.
0 1
1 2
2 3
3 4
4 5
5 5
6 5
7 6
8 7
9 1
10 1
11 4
...
I tried np.where() with np.arange() but it raise error.

Create a new column and assign value for each group starting using groupby

I want to create a new column as 'fold' and assign new values to it depending on group of quote_id.Let's say if 3 quote_id is same then it should assign 1 and next 4 quote_id is same then it should assign 2.
In short it should assign a number to a particular group of quote_id.
I have been trying from long time but I am not getting expected results.
i=1 def func(x): x['fold']=i return x in_df.groupby('quote_id').apply(func) i=i+1
My output should look like below.
quote_id fold
1300079-DE 1
1300079-DE 1
1300079-DE 1
1300185-DE 2
1300560-DE 3
1301011-DE 4
1301011-DE 4
1301011-DE 4
1301644-DE 5
1301907-DE 6
1301907-DE 6
1301907-DE 6
call rank with method='dense':
In [10]:
df['fold'] = df['quote_id'].rank(method='dense')
df
Out[10]:
quote_id fold
0 1300079-DE 1
1 1300079-DE 1
2 1300079-DE 1
3 1300185-DE 2
4 1300560-DE 3
5 1301011-DE 4
6 1301011-DE 4
7 1301011-DE 4
8 1301644-DE 5
9 1301907-DE 6
10 1301907-DE 6
11 1301907-DE 6

Categories