I am trying to add a new column to my dataframe that depends on values that may or may not exist in previous rows. My dataframe looks like this:
index id timestamp sequence_index value prev_seq_index
0 10 1 0 5 0
1 10 1 1 1 2
2 10 1 2 2 0
3 10 2 0 9 0
4 10 2 1 10 1
5 10 2 2 3 1
6 11 2 0 42 1
7 11 2 1 13 0
Note: there is no relation between index and sequence_index, index is just a counter.
What I want to do is add a column prev_value, that finds the value of the most recent row with the same id and sequence_index == prev_seq_index, if no such previous row exist, use default value, for the purpose of this question I will use default value of -1
index id timestamp sequence_index value prev_seq_index prev_value
0 10 1 0 5 0 -1
1 10 1 1 1 2 -1
2 10 1 2 2 0 -1
3 10 2 0 9 0 5 # value from df[index == 0]
4 10 2 1 10 1 1 # value from df[index == 1]
5 10 2 2 3 1 1 # value from df[index == 1]
6 11 2 0 42 1 -1
7 11 2 1 13 0 -1
My current solution is a brute force which is very slow, and I was wondering if there was a faster way:
prev_values = np.zeros(len(df))
i = 0
for index, row in df.iterrows():
# filter for previous rows with the same id and desired sequence index
tmp_df = df[(df.id == row.id) & (df.timestamp < row.timestamp) \
& (df.sequence_index == row.prev_seq_index)]
if (len(tmp_df) > 0):
# get value from the most recent row
prev_value = tmp_df[tmp_df.index == tmp_df.timestamp.idxmax()].value
else:
prev_value = -1
prev_values[i] = prev_value
i += 1
df['prev_value'] = prev_values
i would suggest tackling this via a left join. However first you'll need to make sure that your data doesn't have duplicates. You'll need to create a dataframe of most recent timestamps and grab the values.
agg=pd.groupby(['sequence_index']).agg({'timestamp':'max'})
agg=pd.merge(agg,df['timestamp','sequence_index','value'], how='inner', on = ['timestamp','sequence_index'])
agg.rename(columns={'value': 'prev_value'}, inplace=True)
now you can join the data back on itself
df=pd.merge(df,agg,how='left',left_on='prev_seq_index',right_on='sequence_index')
now you can deal with the NaN values
df.prev_value=df.prev_value.fillna(-1)
Related
Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec
I have a timeseries dataframe with multiple columns, which contain NaNs independently from each other.
And I have a given lenght "LEN" every sequence of valid elements should at least have.
(By "sequence I mean collecting the values in the indices before.)
Iterating is extremly time inefficient, but it would look similar to this:
LEN = 100
maximum_sequence_len = 0
for i in range(len(df)): # for every index
for col in df.columns: # for every column
df_ = df[col].iloc[:i].dropna()
seq_end_ix = i
seq_start_ix = get_seq_start_where_every_col_has_enough_valids(
df,seq_end,LEN)
necessary_len = len( df.loc[seq_start_ix:seq_end_ix] )
if maximum_sequence_len < necessary_len :
maximum_sequence_len = necessary_len
get_seq_start_where_every_col_has_enough_valids(df,seq_end_ix,LEN)
# determine the index where every column contains at least "LEN" valid elements
first_SEQ_LEN_Sample_start_ix = start_ix
for col in df.columns:
col_df = df[col].dropna()
temp = col_df[col_df.index <= seq_end_ix ].index[-(LEN)]
if temp < first_SEQ_LEN_Sample_start_ix:
first_SEQ_LEN_Sample_start_ix = temp
seq_start_ix = first_SEQ_LEN_Sample_start_ix
return seq_start_ix
An Example:
LEN = 6 # in this example we have to have at least 6 valid elements in the frame of rows before
print(df)
>>>>
A B C D E F
index
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 | 1
3 NaN 1 1 NaN 1 | 1
4 NaN 1 1 NaN 1 | 1
5 1 1 1 1 1 | 1
6 1 1 1 1 NaN | 1
7 NaN 1 1 NaN 1 | 1
8 NaN 1 1 1 1 | 1
9 1 1 1 1 NaN | 1
10 1 1 1 1 NaN | 1
11 1 1 1 NaN NaN | 1
12 1 1 1 1 NaN | 1
13 1 1 1 1 NaN | 1
14 1 NaN 1 1 NaN |* 1
16 1 1 1 1 1 NaN
17 NaN 1 1 1 1 1
18 NaN 1 1 1 1 NaN
19 1 1 1 1 1 1
# ==> Result: 13
# *here, longest sequence necessary to get minimum 6 valids in EVERY column has a length of 13. note, that if the other columns contained more NaNs in the marked indices, then it would probably have taken more than 13.
The problem is that I want to create sequence samples, but dont know how long they have to be so that each sample has at least "LEN" valid elements in every column.
Essentially, you need to maintain a vector counter with one counter for each column.
The vector counter should signal a 'window-ready' if all counters are at least 6. If a window (start_index, end_index) is ready, you can emit all rows in the window and reset the window's start_index, end_index to current row and reset all counters to zero.
Repeat until end of data.
Algorithm get_windows(data[][])
counters: array of integers of length = data.cols, values initialized to 0
Begin
window_start_index = 0
window_end_index = 0
for each row in data
for each col in row
if(value(col) != NaN)
counters[index(col)]++;
end if
next // col
// check if row causes window to continue
continue_flag = false;
for each counter in counters
if(counter != 6)
continue_flag = true
exit for loop
end if
next // counter
if(continue_flag)
window_end_index++;
else
// we have a window (window_start_index, window_end_index)
// both inclusive
// do something with the window
// reset counters
for each counter in counters
counter = 0
next
end if
next // row
End Algorithm
Is this single pass algorithm what you need?
I have a DataFrame like this:
subject trial attended
0 1 1 1
1 1 3 0
2 1 4 1
3 1 7 0
4 1 8 1
5 2 1 1
6 2 2 1
7 2 6 1
8 2 8 0
9 2 9 1
10 2 11 1
11 2 12 1
12 2 13 1
13 2 14 1
14 2 15 1
I would like to GroupBy subject.
Then iterate in each row of the GroupBy dataframe.
If for a row 'attended' == 1, then to increase a variable sum_reactive by 1.
If the sum_reactive variable reaches == 4, then to add in a dictionary the 'subject' and 'trial' in which the variable sum_reactive reached a value of 4.
I as trying to define a function for this, but it doesn't work:
def count_attended():
sum_reactive = 0
dict_attended = {}
for i, g in reactive.groupby(['subject']):
for row in g:
if g['attended'][row] == 1:
sum_reactive += 1
if sum_reactive == 4:
dict_attended.update({g['subject'] : g['trial'][row]})
return dict_attended
return dict_attended
I think that I don't have clear how to iterate inside each GroupBy dataframe. I'm quite new using pandas.
IIUC try,
df = df.query('attended == 1')
df.loc[df.groupby('subject')['attended'].cumsum() == 4, ['subject', 'trial']].to_dict(orient='record')
Output:
[{'subject': 2, 'trial': 9}]
Using groupby with cumsum will do the counting attended, then check to see when this value equals to 4 to create a boolean series. You can use this boolean series to do boolean indexing to filter your dataframe to certain rows. Lastly, with lock and column filtering select subject and trial.
I would like a formula or anything that acts like a "switch". If the column 'position' goes to 3 or above, the switch is turned on (=1). If 'position' goes above 5, the switch is turned off (=0). And if position goes below 3, the switch is also turned off (=0). I have included the column 'desired' to display what I would like this new column to automate.
df = pd.DataFrame()
df['position'] = [1,2,3,4,5,6,7,8,7,6,5,4,3,2,1,2,3,4,5,4,3,2,1]
df['desired'] = [0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0]
I would use .shift() to create row with shifted position to have current and provious value in one row. And then I can check if it goes above 3 or 5 or below 3 and change value which will be assigned to in column 'desired'.
After creating column `'desired' I have to drop shifted data.
import pandas as pd
df = pd.DataFrame()
df['position'] = [1,2,3,4,5,6,7,8,7,6,5,4,3,2,1,2,3,4,5,4,3,2,1]
#df['desired'] = [0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0]
df['previous'] = df['position'].shift()
# ---
value = 0
def change(row):
global value
#print(row)
if (row['previous'] < 3) and (row['position'] >= 3):
value = 1
if (row['previous'] >= 3) and (row['position'] < 3):
value = 0
if (row['previous'] <= 5) and (row['position'] > 5):
value = 0
return value
# ---
#for ind, row in df.iterrows():
# print(int(row['position']), change(row))
df['desired'] = df.apply(change, axis=1)
df.drop('previous', axis=1)
print(df)
Result
position desired
0 1 0
1 2 0
2 3 1
3 4 1
4 5 1
5 6 0
6 7 0
7 8 0
8 7 0
9 6 0
10 5 0
11 4 0
12 3 0
13 2 0
14 1 0
15 2 0
16 3 1
17 4 1
18 5 1
19 4 1
20 3 1
21 2 0
22 1 0
This is how my data looks like:
Day Price A Price B Price C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 64503 43692 79982
6 86664 69990 53468
7 77924 62998 68911
8 66600 68830 94396
9 82664 89972 49614
10 59741 48904 49528
11 34030 98074 72993
12 74400 85547 37715
13 51031 50031 85345
14 74700 59932 73935
15 62290 98130 88818
I have a small python script that outputs a sum for each column. I need to input an n value (for number of days) and the summing will run and output the values.
However, for example, given n=5 (for days), I want to output only Price A/B/C rows starting from the next day (which is day 6). Hence, the row for Day 5 should be '0'.
How can I produce this logic on Pandas ?
The idea I have is to use the n input value to then, truncate values on the rows corresponding to that particular (n day value). But how can I do this on code ?
if dataframe['Day'] == n:
dataframe['Price A'] == 0 & dataframe['Price B'] == 0 & dataframe['Price C'] == 0
You can filter rows by condition and set all columns without first by iloc[mask, 1:], for next row add Series.shift:
n = 5
df.iloc[(df['Day'].shift() <= n).values, 1:] = 0
print (df)
Day Price A Price B Price C
0 1 0 0 0
1 2 0 0 0
2 3 0 0 0
3 4 0 0 0
4 5 0 0 0
5 6 0 0 0
6 7 77924 62998 68911
7 8 66600 68830 94396
8 9 82664 89972 49614
9 10 59741 48904 49528
10 11 34030 98074 72993
11 12 74400 85547 37715
12 13 51031 50031 85345
13 14 74700 59932 73935
14 15 62290 98130 88818
Pseudo Code
Make sure to sort by day
shift columns 'A', 'B' and 'C' by n and fill in with 0
Sum accordingly
All of that can be done on one line as well
It is simply
dataframe.iloc[:n+1] = 0
This sets the values of all columns for the first n days to 0
# Sample output
dataframe
a b
0 1 2
1 2 3
2 3 4
3 4 2
4 5 3
n = 1
dataframe.iloc[:n+1] = 0
dataframe
a b
0 0 0
1 0 0
2 3 4
3 4 2
4 5 3
This truncates all for all the previous days. If you want to truncate only for the nth day.
dataframe.iloc[n] = 0