Access next, previous, or current row in pandas .loc[] assignment - python

Under the if-then section of the pandas documentation cookbook, we can assign values in one column, based on a condition being met for a separate column using loc[].
df = pd.DataFrame({'AAA' : [4,5,6,7],
'BBB' : [10,20,30,40],
'CCC' : [100,50,-30,-50]})
# AAA BBB CCC
# 0 4 10 100
# 1 5 20 50
# 2 6 30 -30
# 3 7 40 -50
df.loc[df.AAA >= 5,'BBB'] = -1
# AAA BBB CCC
# 0 4 10 100
# 1 5 -1 50
# 2 6 -1 -30
# 3 7 -1 -50
But what if I want to write a condition that involves the previous or subsequent row using .loc[]? For example, say I want to assign df.BBB=5 wherever the difference between the df.CCC of the current row and the df.CCC of the next row is greater than or equal to 50. Then I would like to create a condition that gives me the following data frame:
# AAA BBB CCC
# 0 4 5 100 <-| 100 - 50 = 50, assign df.BBB = 5
# 1 5 5 50 <-| 50 -(-30)= 80, assign df.BBB = 5
# 2 6 -1 -30 <-| 30 -(-50)= 20, don't assign df.BBB = 5
# 3 7 -1 -50 <-| (-50) -0 =-50, don't assign df.BBB = 5
How can I get this result?
Edit
The answer I'm hoping to find is something like
mask = df['CCC'].current - df['CCC'].next >= 50
df.loc[mask, 'BBB'] = 5
because I'm interested in the general problem of how I can access values above or below the current row being considered in a dataframe.(not necessarily solving this one toy example.)
diff() will work on the example I first described, but what of other cases, say, where we want to compare two elements instead of subtracting them?
What if I take the previous data frame and I want to find all rows where the current column entry doesn't match the next in df.BBB and then assign df.CCC based on those comparisons?
if df.BBB.current == df.CCC.next:
df.CCC = 1
# AAA BBB CCC
# 0 4 5 1 <-| 5 == 5, assign df.CCC = 1
# 1 5 5 50 <-| 5 != -1, do nothing
# 2 6 -1 1 <-| -1 == -1, assign df.CCC = 1
# 3 7 -1 -50 <-| -1 != 0, do nothing
Is there a way to do this with pandas using .loc[]?

Given
>>> df
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
you can compute a boolean mask first via
>>> mask = df['CCC'].diff(-1) >= 50
>>> mask
0 True
1 True
2 False
3 False
Name: CCC, dtype: bool
and then issue
>>> df.loc[mask, 'BBB'] = 5
>>>
>>> df
AAA BBB CCC
0 4 5 100
1 5 5 50
2 6 30 -30
3 7 40 -50
More generally, you can compute a shift
>>> df['CCC_next'] = df['CCC'].shift(-1) # or df['CCC'].shift(-1).fillna(0)
>>> df
AAA BBB CCC CCC_next
0 4 5 100 50.0
1 5 5 50 -30.0
2 6 30 -30 -50.0
3 7 40 -50 NaN
... and then do whatever you want, such as:
>>> df['CCC'].sub(df['CCC_next'], fill_value=0)
0 50.0
1 80.0
2 20.0
3 -50.0
dtype: float64
>>> mask = df['CCC'].sub(df['CCC_next'], fill_value=0) >= 50
>>> mask
0 True
1 True
2 False
3 False
dtype: bool
although for the specific problem in your question the diff approach is sufficient.

You can use enumerate function to access row and its index simultaneously. Thus you can obtain previous and next row based on the index of the current row. I provide an example script below for your reference:
import pandas as pd
df = pd.DataFrame({'AAA' : [4,5,6,7],
'BBB' : [10,20,30,40],
'CCC' : [100,50,-30,-50]}, index=['a','b','c','d'])
print('row_pre','row_pre_AAA','row','row_AA','row_next','row_next_AA')
for irow, row in enumerate(df.index):
if irow==0:
row_next = df.index[irow+1]
print('row_pre', "df.loc[row_pre,'AAA']", row, df.loc[row,'AAA'], row_next, df.loc[row_next,'AAA'])
elif irow>0 and irow<df.index.size-1:
row_pre = df.index[irow-1]
row_next = df.index[irow+1]
print(row_pre, df.loc[row_pre,'AAA'], row, df.loc[row,'AAA'], row_next, df.loc[row_next,'AAA'])
else:
row_pre = df.index[irow-1]
print(row_pre, df.loc[row_pre,'AAA'], row, df.loc[row,'AAA'], 'row_next', "df.loc[row_next,'AAA']")
Output as below:
row_pre row_pre_AAA row row_AA row_next row_next_AA
row_pre df.loc[row_pre,'AAA'] a 4 b 5
a 4 b 5 c 6
b 5 c 6 d 7
c 6 d 7 row_next df.loc[row_next,'AAA']

Related

Replace a particular column value with 1 and the rest with 0

I have a DataFrame which has a column containing these values with % occurrence
I want to convert the value with highest occurrence as 1 and the rest as 0.
How can I do the same using Pandas?
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'availability': np.random.randint(0, 100, 10), 'some_col': np.random.randn(10)})
print(df)
"""
availability some_col
0 9 -0.332662
1 35 0.193257
2 1 2.042402
3 50 -0.298372
4 52 -0.669655
5 3 -1.031884
6 44 -0.763867
7 28 1.093086
8 67 0.723319
9 87 -1.439568
"""
df['availability'] = np.where(df['availability'] == df['availability'].max(), 1, 0)
print(df)
"""
availability some_col
0 0 -0.332662
1 0 0.193257
2 0 2.042402
3 0 -0.298372
4 0 -0.669655
5 0 -1.031884
6 0 -0.763867
7 0 1.093086
8 0 0.723319
9 1 -1.439568
"""
Edit
If you are trying to mask the rows with the values that occur most often instead, try this:
df = pd.DataFrame(
{
'availability': [10, 10, 20, 30, 40, 40, 50, 50, 50, 50],
'some_col': np.random.randn(10)
}
)
print(df)
"""
availability some_col
0 10 0.954199
1 10 0.779256
2 20 -0.438860
3 30 -2.547989
4 40 0.587108
5 40 0.398858
6 50 0.776177 # <--- Most Frequent is 50
7 50 -0.391724 # <--- Most Frequent is 50
8 50 -0.886805 # <--- Most Frequent is 50
9 50 1.989000 # <--- Most Frequent is 50
"""
df['availability'] = np.where(df['availability'].isin(df['availability'].mode()), 1, 0)
print(df)
"""
availability some_col
0 0 0.954199
1 0 0.779256
2 0 -0.438860
3 0 -2.547989
4 0 0.587108
5 0 0.398858
6 1 0.776177
7 1 -0.391724
8 1 -0.886805
9 1 1.989000
"""
Try:
df.availability.apply(lambda x: 1 if x == df.availability.value_counts().idxmax() else 0)
You can use Series.mode() to get the most often value and use isin to check if value in column in list
df['col'] = df['availability'].isin(df['availability'].mode()).astype(int)
You can compare to the mode with isin, then convert the boolean to integer (True -> 1, False -> 0):
df['col2'] = df['col'].isin(df['col'].mode()).astype(int)
example (here, 2 and 4 are tied as most frequent value), as new column "col2" for clarity:
col col2
0 0 0
1 2 1
2 2 1
3 2 1
4 4 1
5 4 1
6 4 1
7 1 0

Apply a value across a group of a Pandas Data Frame

I am trying to summarize values across each group where the types match and apply that to the row where store=1.
The example below for Group A contains one store=1 and three store=2.
I would like to roll up all type 3's in Level=A to the store=1 row
Sample data:
data = {'group':['A','A','A','A','B','B','B','B'],'store':['1','2','2','2','1','2','2','2'],'type':['3','3','1','1','5','0','5','5'],'num':['10','20','30','40','50','60','70','80']}
t1=pd.DataFrame(data)
group store type num
A 1 3 10
A 2 3 20
A 2 1 30
A 2 1 40
B 1 5 50
B 2 0 60
B 2 5 70
B 2 5 80
and the correct output should be a new column ('new_num') containing a list at the store=1 row for each group where the types match.
group store type num new_num
A 1 3 10 ['10','20']
A 2 3 20 []
A 2 1 30 []
A 2 1 40 []
B 1 5 50 ['50','70','80']
B 2 0 60 []
B 2 5 70 []
B 2 5 80 []
IIUC
t1['new_num']=[[] for x in range(len(t1))]
t1.loc[t1.store=='1','new_num']=[y.loc[y.type.isin(y.loc[y.store=='1','type']),'num'].tolist() for x , y in t1.groupby('group',sort=False)]
t1
Out[369]:
group store type num new_num
0 A 1 3 10 [10, 20]
1 A 2 3 20 []
2 A 2 1 30 []
3 A 2 1 40 []
4 B 1 5 50 [50, 70, 80]
5 B 2 0 60 []
6 B 2 5 70 []
7 B 2 5 80 []
Setup
ncol = [[] for _ in range(t1.shape[0])]
res = t1.set_index('group').assign(new_num=ncol)
1) Using some wonky string concats and groupby's
u = t1.group + t1.type
check = u[t1.store.eq('1')]
m = t1.loc[u.isin(check)].groupby('group')['num'].agg(list)
res.loc[res.store.eq('1'), 'new_num'] = m
2) If you'd like to stray even further from the light, use an abomination of a pivot
f = t1.pivot_table(
index=['group', 'type'],
columns='store',
values='num',
aggfunc=list
).reset_index()
m = f[f['1'].notnull()].set_index('group').drop('type', 1).sum(1)
res.loc[res.store.eq('1'), 'new_num'] = m
Both somehow manage to produce:
store type num new_num
group
A 1 3 10 [10, 20]
A 2 3 20 []
A 2 1 30 []
A 2 1 40 []
B 1 5 50 [50, 70, 80]
B 2 0 60 []
B 2 5 70 []
B 2 5 80 []
While a terrible use of pivot, I actually think that solution is pretty neat:
store group type 1 2
0 A 1 NaN [30, 40]
1 A 3 [10] [20]
2 B 0 NaN [60]
3 B 5 [50] [70, 80]
It produces the above aggregation, which you can find the non-null values which are all of the matching group-type combinations that you are after, and summing across those rows gives you the aggregated list you need.

How to pass a value from one row to the next one in pandas + python and use it to calculate the same following value recursively

This is my desired output:
I am trying to calculate the column df[Value] and df[Value_Compensed]. However, to do that, I need to consider the previous value of the row df[Value_Compensed]. In terms of my table:
The first row all the values are 0
The following rows: df[Remained] = previous df[Value_compensed]. Then df[Value] = df[Initial_value] + df[Remained]. Then df[Value_Compensed] = df[Value] - df[Compensation]
...And So on...
I am struggling to pass the value of Value_Compensed from one row to the next, I tried with the function shift() but as you can see in the following image the values in df[Value_Compensed] are not correct due to it is not a static value and also it also changes after each row it did not work. Any Idea??
Thanks.
Manuel.
You can use apply to create your customised operations. I've made a dummy dataset as you didn't provide the initial dataframe.
from itertools import zip_longest
# dummy data
df = pd.DataFrame(np.random.randint(1, 10, (8, 5)),
columns=['compensation', 'initial_value',
'remained', 'value', 'value_compensed'],)
df.loc[0] = 0,0,0,0,0
>>> print(df)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 2 9 1 9 7
2 1 4 9 8 3
3 3 4 5 7 6
4 3 2 5 5 6
5 9 1 5 2 4
6 4 5 9 8 2
7 1 6 9 6 8
Use apply (axis=1) to do row-wise iteration, where you use the initial dataframe as an argument, from which you can then get the previous row x.name-1 and do your calculations. Not sure if I fully understood the intended result, but you can adjust the individual calculations of the different columns in the function.
def f(x, data):
if x.name == 0:
return [0,]*data.shape[1]
else:
x_remained = data.loc[x.name-1]['value_compensed']
x_value = data.loc[x.name-1]['initial_value'] + x_remained
x_compensed = x_value - x['compensation']
return [x['compensation'], x['initial_value'], x_remained, \
x_value, x_compensed]
adj = df.apply(f, args=(df,), axis=1)
adj = pd.DataFrame.from_records(zip_longest(*adj.values), index=df.columns).T
>>> print(adj)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 5 9 0 0 -5
2 5 7 4 13 8
3 7 9 1 8 1
4 6 6 5 14 8
5 4 9 6 12 8
6 2 4 2 11 9
7 9 2 6 10 1

Looping through a groupby and adding a new column

I need to write a small script to get through some data (around 50k rows/file) and my original file looks like this:
Label ID TRACK_ID QUALITY POSITION_X POSITION_Y POSITION_Z POSITION_T FRAME RADIUS VISIBILITY MANUAL_COLOR MEAN_INTENSITY MEDIAN_INTENSITY MIN_INTENSITY MAX_INTENSITY TOTAL_INTENSITY STANDARD_DEVIATION ESTIMATED_DIAMETER CONTRAST SNR
ID1119 1119 9 6.672 384.195 122.923 0 0 0 5 1 -10921639 81.495 0 0 255 7905 119.529 5.201 1 0.682
ID2237 2237 9 7.078 381.019 122.019 0 1 1 5 1 -10921639 89.381 0 0 255 8670 122.301 5.357 1 0.731
ID2512 2512 9 7.193 377.739 120.125 0 2 2 5 1 -10921639 92.01 0 0 255 8925 123.097 5.356 1 0.747
(...)
ID1102 1102 18 4.991 808.857 59.966 0 0 0 5 1 -10921639 52.577 0 0 255 5100 103.7 4.798 1 0.507
(...)
Its a rather big table with up to 50k rows. Now not all the data is important to me, I mainly need the Track_ID and the X and Y Position.
So I create a dataframe using the excel file and only access the corresponding columns
IN df = pd.read_excel('.../sample.xlsx', 'Sheet1',parse_cols="D, F,G")
And this works as expected. Each track_id is basically one set of data that needs to be analyzed. So the straight forward way is to group the dataframe by track_id
IN Grouping = df.groupby("TRACK_ID")
Also works as intended. Now I need to grab the first POSITION_X value of each group and substract them from the other POSITION_X values in that group.
Now, I already read that looping is probably not the best way to go about it, but I have no idea how else to do it.
for name, group in Grouping:
first_X = group.iloc[0, 1]
vect = group.iloc[1:,1] - first_X
This stores the value in vect, which, if I print it out, gives me the correct value. However, I have the problem that I do not know how to add it now to a new column.
Maybe someone could guide me into the correct direction. Thanks in advance.
EDIT
This was suggested by chappers
def f(grouped):
grouped.iloc[1:] = 0
return grouped
grouped = df.groupby('TRACK_ID')
df['Calc'] = grouped['POSITION_X'].apply(lambda x: x - x.iloc[0]) grouped['POSITION_X'].apply(f)
for name, group in grouped:
print name
print group
Input:
TRACK_ID POSITION_X POSITION_Y
0 9 384.195 122.923
1 9 381.019 122.019
2 9 377.739 120.125
3 9 375.211 117.224
4 9 373.213 113.938
5 9 371.625 110.161
6 9 369.803 106.424
7 9 367.717 103.239
8 18 808.857 59.966
9 18 807.715 61.032
10 18 808.165 63.133
11 18 810.147 64.853
12 18 812.084 65.084
13 18 812.880 63.683
14 18 812.083 62.203
15 18 810.041 61.188
16 18 808.568 62.260
Output for group == 9
TRACK_ID POSITION_X POSITION_Y Calc
0 9 384.195 122.923 384.195
1 9 381.019 122.019 -3.176
2 9 377.739 120.125 -6.456
3 9 375.211 117.224 -8.984
4 9 373.213 113.938 -10.982
5 9 371.625 110.161 -12.570
6 9 369.803 106.424 -14.392
7 9 367.717 103.239 -16.478
So expected Output would be that the very first calc value of every group is 0
Here is one way of approaching it, using the apply method to subtract the first item from all the other obs.
df = DataFrame({'A' : ['foo', 'foo', 'foo', 'foo',
'bar', 'bar', 'bar', 'bar'],
'C' : [1,2,3,4,4,3,2,1]})
grouped = df.groupby('A')
df['C1'] = grouped['C'].apply(lambda x: x - x.iloc[0])
This would have input:
A C
0 foo 1
1 foo 2
2 foo 3
3 foo 4
4 bar 4
5 bar 3
6 bar 2
7 bar 1
and output
A C C1
0 foo 1 0
1 foo 2 1
2 foo 3 2
3 foo 4 3
4 bar 4 0
5 bar 3 -1
6 bar 2 -2
7 bar 1 -3

python dataframe indexed by a list

I am trying to take a DataFrame column that contains repeating values from a finite set and substitute these values by index number, so if the values are [200,20,1000,1] the indexes of their occurrence will be [1,2,3,4].
Actual data example is:
0 aaa
1 aaa
2 bbb
3 aaa
4 bbb
5 bbb
6 ccc
7 ddd
8 ccc
9 ddd
The desired output is
0 1
1 1
2 2
3 1
4 2
5 2
6 4
7 3
8 4
9 3
I want to change the values that make little sense to numbers. That's all... I do not care about the order of indexing, i.e. 1 could be 3 and so on, as long the ordering is consistent. I.e., I don't care if ['aaa','bbb','ccc','ddd'] will be indexed by [1,2,3,4] or [2,4,3,1].
Suppose that the DF name is tbl and I want to change only a subset of indexes in column 'aaa'. Let's denote these indexes by tbl_ind. The way I want to do that is:
tmp_r = tbl[tbl_ind]
un_r_ind = np.unique(tmp_r)
for r_ind in range(len(un_r_ind)):
r_ind_ind = np.array(np.where(tmp_r == un_r_ind[r_ind])[0])
for j_ind in range(len(r_ind_ind)):
tbl['aaa'].iloc[tbl_ind[r_ind_ind[j_ind]]] = r_ind
It works. And it is REALLY slow on large data sets.
Python does not let to update tbl['aaa'].iloc[tbl_ind[r_ind_ind]] as it's a list of indexes....
Help please? How is it possible to speed this up?
Many thanks!
I'd construct a dict of the values you want to replace and then call map:
In [7]:
df
Out[7]:
data
0
1 aaa
2 bbb
3 aaa
4 bbb
5 bbb
6 ccc
7 ddd
8 ccc
9 ddd
In [8]:
d = {'aaa':1,'bbb':2,'ccc':3,'ddd':4}
df['data'] = df['data'].map(d)
df
Out[8]:
data
0
1 1
2 2
3 1
4 2
5 2
6 3
7 4
8 3
9 4
You could use rank with the dense method:
>>> df[0].rank("dense")
0 1
1 1
2 2
3 1
4 2
5 2
6 3
7 4
8 3
9 4
Name: 0, dtype: float64
This basically sorts the values and maps the lowest to 1, the second-lowest to 2, and so on.
I am not sure I have understood correctly from your example.
Is this what you are trying to achieve? (apart from the bias on the index (zero instead of one)):
df=['aaa','aaa','bbb','aaa','bbb','bbb','ccc','ddd','ccc','ddd']
idx={}
def index_data(v):
global idx
if v in idx:
return idx[v]
else:
n = len(idx)
idx[v] = n
return n
if __name__ == "__main__":
outlist = []
for i in df:
outlist.append(index_data(i))
for i, v in enumerate(outlist):
print i, v
It outputs:
0 0
1 0
2 1
3 0
4 1
5 1
6 2
7 3
8 2
9 3
Obviously it can be optimised (e.g. simply incrementing a counter for n instead of checking the size of the index)

Categories