I'm trying to make a counter that would change values only when it's different from the previous row or the ID I'm grouping by changes
Let's say I have the following dataframe:
ID Flag New_Column
A NaN 1
A 0 1
A 0 1
A 0 1
A 1 2
A 1 2
A 1 2
A 0 3
A 0 3
A 0 3
A 1 4
A 1 4
A 1 4
B NaN 1
B 0 1
I want to create New_Column where every time the Flag values changes, I'd increment New_Column by one and if the ID changes, it would reset to one and start over
Here is what I tried to do using np.select but it's not working
df['New_Column'] = None
df['Flag_Lag'] = df.sort_values(by=['ID', 'Date_Time'], ascending=True).groupby(['ID'])['Flag'].shift(1)
df['ID_Lag'] = df.sort_values(by=['ID', 'Date_Time'], ascending=True).groupby(['ID'])['ID'].shift(1)
conditions = [((df['Flag'] != df['Flag_Lag']) & (df['ID'] == df['ID_Lag'])),
((df['Flag'] == df['Flag_Lag']) & (df['ID'] == df['ID_Lag'])),
((df['Flag_Lag'] == np.nan) & (df['New_Column'].shift(1) == 1)),
((df['ID'] != df['ID_Lag']))
]
choices = [(df['New_Column'].shift(1) + 1),
(df['New_Column'].shift(1)),
(df['New_Column'].shift(1)),
1]
df['New_Column'] = np.select(conditions, choices, default=np.nan)
With this code, the first value for New_Column is 1, the second is NaN and the rest is None
Does anyone know a better way to do this?
Group by ID and use cumulative sum of (current is not equal previous)
df['new'] = df.groupby('ID') \
apply(lambda x: x['Flag'].fillna(0).diff().ne(0).cumsum()).reset_index(level=0, drop=True)
ID Flag New_Column new
0 A NaN 1 1
1 A 0.0 1 1
2 A 0.0 1 1
3 A 0.0 1 1
4 A 1.0 2 2
5 A 1.0 2 2
6 A 1.0 2 2
7 A 0.0 3 3
8 A 0.0 3 3
9 A 0.0 3 3
10 A 1.0 4 4
11 A 1.0 4 4
12 A 1.0 4 4
13 B NaN 1 1
14 B 0.0 1 1
If speed is not a concern and you want some easy-to-read code, you could simply iterate over the dataframe and run a simple function for each row.
def f(row):
global previous_ID, previous_flag, previous_count
if previous_ID == False: #let's start the count
row['New_Column'] = 1
elif previous_ID != row['ID']: #let's start the count over
row['New_Column'] = 1
elif previous_flag == row['Flag']: #same ID, same Flag
row['New_Column'] = previous_count
else: #same ID, different Flag
row['New_Column'] = previous_count + 1
previous_ID = row['ID']
previous_flag = row['Flag']
previous_count = row['New_Column']
You should fill your NaN values with a 0 probably or add a special case in the function for it.
You can run the function in the following way:
previous_ID, previous_flag, previous_count = False, False, False
df['New_Columns'] = []
for i, row in df.iterrows():
f(row)
And that's it.
Related
I would like to sum values of a dataframe conditionally, based on the values of a different dataframe. Say for example I have two dataframes:
df1 = pd.DataFrame(data = [[1,-1,5],[2,1,1],[3,0,0]],index=[0,1,2],columns = [0,1,2])
index 0 1 2
-----------------
0 1 -1 5
1 2 1 1
2 3 0 0
df2 = pd.DataFrame(data = [[1,1,3],[1,1,2],[0,2,1]],index=[0,1,2],columns = [0,1,2])
index 0 1 2
-----------------
0 1 1 3
1 1 1 2
2 0 2 1
Now what I would like is that for example, if the row/index value of df1 equals 1, to sum the location of those values in df2.
In this example, if the condition is 1, then the sum of df2 would be 4. If the condition was 0, the result would be 3.
Another option with Pandas' query:
df2.query("#df1==1").sum().sum()
# 4
You can use a mask with where:
df2.where(df1.eq(1)).to_numpy().sum()
# or
# df2.where(df1.eq(1)).sum().sum()
output: 4.0
intermediate:
df2.where(df1.eq(1))
0 1 2
0 1.0 NaN NaN
1 NaN 1.0 2.0
2 NaN NaN NaN
Assuming that one wants to store the value in the variable value, there are various options to achieve that. Will leave below two of them.
Option 1
One can simply do the following
value = df2[df1 == 1].sum().sum()
[Out]: 4.0 # numpy.float64
# or
value = sum(df2[df1 == 1].sum())
[Out]: 4.0 # float
Option 2
Using pandas.DataFrame.where
value = df2.where(df1 == 1, 0).sum().sum()
[Out]: 4.0 # numpy.int64
# or
value = sum(df2.where(df1 == 1, 0).sum())
[Out]: 4 # int
Notes:
Both df2[df1 == 1] and df2.where(df1 == 1, 0) give the following output
0 1 2
0 1.0 NaN NaN
1 NaN 1.0 2.0
2 NaN NaN NaN
Depending on the desired output (float, int, numpy.float64,...) one method might be better than the other.
Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec
I am trying to add a new column to my dataframe that depends on values that may or may not exist in previous rows. My dataframe looks like this:
index id timestamp sequence_index value prev_seq_index
0 10 1 0 5 0
1 10 1 1 1 2
2 10 1 2 2 0
3 10 2 0 9 0
4 10 2 1 10 1
5 10 2 2 3 1
6 11 2 0 42 1
7 11 2 1 13 0
Note: there is no relation between index and sequence_index, index is just a counter.
What I want to do is add a column prev_value, that finds the value of the most recent row with the same id and sequence_index == prev_seq_index, if no such previous row exist, use default value, for the purpose of this question I will use default value of -1
index id timestamp sequence_index value prev_seq_index prev_value
0 10 1 0 5 0 -1
1 10 1 1 1 2 -1
2 10 1 2 2 0 -1
3 10 2 0 9 0 5 # value from df[index == 0]
4 10 2 1 10 1 1 # value from df[index == 1]
5 10 2 2 3 1 1 # value from df[index == 1]
6 11 2 0 42 1 -1
7 11 2 1 13 0 -1
My current solution is a brute force which is very slow, and I was wondering if there was a faster way:
prev_values = np.zeros(len(df))
i = 0
for index, row in df.iterrows():
# filter for previous rows with the same id and desired sequence index
tmp_df = df[(df.id == row.id) & (df.timestamp < row.timestamp) \
& (df.sequence_index == row.prev_seq_index)]
if (len(tmp_df) > 0):
# get value from the most recent row
prev_value = tmp_df[tmp_df.index == tmp_df.timestamp.idxmax()].value
else:
prev_value = -1
prev_values[i] = prev_value
i += 1
df['prev_value'] = prev_values
i would suggest tackling this via a left join. However first you'll need to make sure that your data doesn't have duplicates. You'll need to create a dataframe of most recent timestamps and grab the values.
agg=pd.groupby(['sequence_index']).agg({'timestamp':'max'})
agg=pd.merge(agg,df['timestamp','sequence_index','value'], how='inner', on = ['timestamp','sequence_index'])
agg.rename(columns={'value': 'prev_value'}, inplace=True)
now you can join the data back on itself
df=pd.merge(df,agg,how='left',left_on='prev_seq_index',right_on='sequence_index')
now you can deal with the NaN values
df.prev_value=df.prev_value.fillna(-1)
I have a dataframe column with 1s and 0s like this:
df['working'] =
1
1
0
0
0
1
1
0
0
1
which represents when a machine is working (1) or stopped (0). I need to classify this stops depending on their length ie if there are less or equal than n consecutive 0s change all them to short-stop (2) if there are more than n, to long-stop (3). The expected result should look like this when applied over the example with n=2:
df[['working', 'result']]=
1 1
1 1
0 3
0 3
0 3
1 1
1 1
0 2
0 2
1 1
of course this is an example, my df has more than 1M rows.
I tried looping through it but it's really slow and also using this. But I couldn't achieve to transform it to my problem.
Can anyone help?. Thanks so much in advance.
I hope Series.map with Series.value_counts should be used for improve performance:
n = 2
#compare 0 values
m = df['working'].eq(0)
#created groups only by mask
s = df['working'].cumsum()[m]
#counts only 0 groups
out = s.map(s.value_counts())
#set new values by mask
df['result'] = 1
df.loc[m, 'result'] = np.where(out > n, 3, 2)
print (df)
working result
0 1 1
1 1 1
2 0 3
3 0 3
4 0 3
5 1 1
6 1 1
7 0 2
8 0 2
9 1 1
Here's one approach:
# Counter for each gruop where there is a change
m = df.working.ne(df.working.shift()).cumsum()
# mask where working is 0
eq0 = df.working.eq(0)
# Get a count of consecutive 0s
count = df[eq0].groupby(m[eq0]).transform('count')
# replace 0s accordingly
df.loc[eq0, 'result'] = np.where(count > 2, 3, 2).ravel()
# fill the remaining values with 1
df['result'] = df.result.fillna(1)
print(df)
working result
0 1 1.0
1 1 1.0
2 0 3.0
3 0 3.0
4 0 3.0
5 1 1.0
6 1 1.0
7 0 2.0
8 0 2.0
9 1 1.0
I have a dataset that looks something like this:
index Ind. Code Code_2
1 1 NaN x
2 0 7 NaN
3 1 9 z
4 1 NaN a
5 0 11 NaN
6 1 4 NaN
I also created a list to indicate values in the column Code, something like this:
Code_List=['7', '9', '11']
I would like to create a new column for the indicator that is 1 so long as Ind. = 1, Code is in the above list, and Code 2 is not null
I would like to create a function containing an if statement. I tried this and am not sure if its a syntax issue, but i keep getting attribute errors such as the following:
def New_Indicator(x):
if x['Ind.'] == 1 and (x['Code'].isin[Code_List]) or (x['Code_2'].notnull()):
return 1
else:
return 0
df['NewIndColumn'] = df.apply(lambda x: New_Indicator(x), axis=1)
("'str' object has no attribute 'isin'", 'occurred at index 259')
("'float' object has no attribute 'notnull'", 'occurred at index
259')
The problem is that in your function, x['Code'] is a string, not a Series. I suggest you use numpy.where:
ind1 = df['Ind.'].eq(1)
codes = df.Code.isin(code_list)
code2NotNull = df.Code_2.notnull()
mask = ind1 & codes & code2NotNull
df['indicator'] = np.where(mask, 1, 0)
print(df)
Output
index Ind. Code Code_2 indicator
0 1 1 NaN x 0
1 2 0 7.0 NaN 0
2 3 1 9.0 z 1
3 4 1 NaN a 0
4 5 0 11.0 NaN 0
5 6 1 4.0 NaN 0
Update (as suggested by #splash58):
df['indicator'] = mask.astype(int)