Looping through a groupby and adding a new column - python

I need to write a small script to get through some data (around 50k rows/file) and my original file looks like this:
Label ID TRACK_ID QUALITY POSITION_X POSITION_Y POSITION_Z POSITION_T FRAME RADIUS VISIBILITY MANUAL_COLOR MEAN_INTENSITY MEDIAN_INTENSITY MIN_INTENSITY MAX_INTENSITY TOTAL_INTENSITY STANDARD_DEVIATION ESTIMATED_DIAMETER CONTRAST SNR
ID1119 1119 9 6.672 384.195 122.923 0 0 0 5 1 -10921639 81.495 0 0 255 7905 119.529 5.201 1 0.682
ID2237 2237 9 7.078 381.019 122.019 0 1 1 5 1 -10921639 89.381 0 0 255 8670 122.301 5.357 1 0.731
ID2512 2512 9 7.193 377.739 120.125 0 2 2 5 1 -10921639 92.01 0 0 255 8925 123.097 5.356 1 0.747
(...)
ID1102 1102 18 4.991 808.857 59.966 0 0 0 5 1 -10921639 52.577 0 0 255 5100 103.7 4.798 1 0.507
(...)
Its a rather big table with up to 50k rows. Now not all the data is important to me, I mainly need the Track_ID and the X and Y Position.
So I create a dataframe using the excel file and only access the corresponding columns
IN df = pd.read_excel('.../sample.xlsx', 'Sheet1',parse_cols="D, F,G")
And this works as expected. Each track_id is basically one set of data that needs to be analyzed. So the straight forward way is to group the dataframe by track_id
IN Grouping = df.groupby("TRACK_ID")
Also works as intended. Now I need to grab the first POSITION_X value of each group and substract them from the other POSITION_X values in that group.
Now, I already read that looping is probably not the best way to go about it, but I have no idea how else to do it.
for name, group in Grouping:
first_X = group.iloc[0, 1]
vect = group.iloc[1:,1] - first_X
This stores the value in vect, which, if I print it out, gives me the correct value. However, I have the problem that I do not know how to add it now to a new column.
Maybe someone could guide me into the correct direction. Thanks in advance.
EDIT
This was suggested by chappers
def f(grouped):
grouped.iloc[1:] = 0
return grouped
grouped = df.groupby('TRACK_ID')
df['Calc'] = grouped['POSITION_X'].apply(lambda x: x - x.iloc[0]) grouped['POSITION_X'].apply(f)
for name, group in grouped:
print name
print group
Input:
TRACK_ID POSITION_X POSITION_Y
0 9 384.195 122.923
1 9 381.019 122.019
2 9 377.739 120.125
3 9 375.211 117.224
4 9 373.213 113.938
5 9 371.625 110.161
6 9 369.803 106.424
7 9 367.717 103.239
8 18 808.857 59.966
9 18 807.715 61.032
10 18 808.165 63.133
11 18 810.147 64.853
12 18 812.084 65.084
13 18 812.880 63.683
14 18 812.083 62.203
15 18 810.041 61.188
16 18 808.568 62.260
Output for group == 9
TRACK_ID POSITION_X POSITION_Y Calc
0 9 384.195 122.923 384.195
1 9 381.019 122.019 -3.176
2 9 377.739 120.125 -6.456
3 9 375.211 117.224 -8.984
4 9 373.213 113.938 -10.982
5 9 371.625 110.161 -12.570
6 9 369.803 106.424 -14.392
7 9 367.717 103.239 -16.478
So expected Output would be that the very first calc value of every group is 0

Here is one way of approaching it, using the apply method to subtract the first item from all the other obs.
df = DataFrame({'A' : ['foo', 'foo', 'foo', 'foo',
'bar', 'bar', 'bar', 'bar'],
'C' : [1,2,3,4,4,3,2,1]})
grouped = df.groupby('A')
df['C1'] = grouped['C'].apply(lambda x: x - x.iloc[0])
This would have input:
A C
0 foo 1
1 foo 2
2 foo 3
3 foo 4
4 bar 4
5 bar 3
6 bar 2
7 bar 1
and output
A C C1
0 foo 1 0
1 foo 2 1
2 foo 3 2
3 foo 4 3
4 bar 4 0
5 bar 3 -1
6 bar 2 -2
7 bar 1 -3

Related

Auto re-assign ids in a dataframe

I have the following dataframe:
import pandas as pd
data = {'id': [542588, 542594, 542594, 542605, 542605, 542605, 542630, 542630],
'label': [3, 3, 1, 1, 2, 0, 0, 2]}
df = pd.DataFrame(data)
df
id label
0 542588 3
1 542594 3
2 542594 1
3 542605 1
4 542605 2
5 542605 0
6 542630 0
7 542630 2
The id columns contains large integers (6-digits). I want a way to simplify it, starting from 10, so that 542588 becomes 10, 542594 becomes 11, etc...
Required output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
You can use factorize:
df['id'] = df['id'].factorize()[0] + 10
Output:
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
Note: factorize will enumerate the keys in the order that they occur in your data, while groupby().ngroup() solution will enumerate the key in the increasing order. You can mimic the increasing order with factorize by sorting the data first. Or you can replicate the data order with groupby() by passing sort=False to it.
You can try
df['id'] = df.groupby('id').ngroup().add(10)
print(df)
id label
0 10 3
1 11 3
2 11 1
3 12 1
4 12 2
5 12 0
6 13 0
7 13 2
This is a naive way of looping through the IDs, and every time you encounter an ID you haven't seen before, associate it in a dictionary with a new ID (starting at 10, incrementing by 1 each time).
You can then swap out the values of the ID column using the map method.
new_ids = dict()
new_id = 10
for old_id in df['id']:
if old_id not in new_ids:
new_ids[old_id] = new_id
new_id += 1
df['id'] = df['id'].map(new_ids)

Given a value or constant, I need to only output relevant rows on Pandas

This is how my data looks like:
Day Price A Price B Price C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 64503 43692 79982
6 86664 69990 53468
7 77924 62998 68911
8 66600 68830 94396
9 82664 89972 49614
10 59741 48904 49528
11 34030 98074 72993
12 74400 85547 37715
13 51031 50031 85345
14 74700 59932 73935
15 62290 98130 88818
I have a small python script that outputs a sum for each column. I need to input an n value (for number of days) and the summing will run and output the values.
However, for example, given n=5 (for days), I want to output only Price A/B/C rows starting from the next day (which is day 6). Hence, the row for Day 5 should be '0'.
How can I produce this logic on Pandas ?
The idea I have is to use the n input value to then, truncate values on the rows corresponding to that particular (n day value). But how can I do this on code ?
if dataframe['Day'] == n:
dataframe['Price A'] == 0 & dataframe['Price B'] == 0 & dataframe['Price C'] == 0
You can filter rows by condition and set all columns without first by iloc[mask, 1:], for next row add Series.shift:
n = 5
df.iloc[(df['Day'].shift() <= n).values, 1:] = 0
print (df)
Day Price A Price B Price C
0 1 0 0 0
1 2 0 0 0
2 3 0 0 0
3 4 0 0 0
4 5 0 0 0
5 6 0 0 0
6 7 77924 62998 68911
7 8 66600 68830 94396
8 9 82664 89972 49614
9 10 59741 48904 49528
10 11 34030 98074 72993
11 12 74400 85547 37715
12 13 51031 50031 85345
13 14 74700 59932 73935
14 15 62290 98130 88818
Pseudo Code
Make sure to sort by day
shift columns 'A', 'B' and 'C' by n and fill in with 0
Sum accordingly
All of that can be done on one line as well
It is simply
dataframe.iloc[:n+1] = 0
This sets the values of all columns for the first n days to 0
# Sample output
dataframe
a b
0 1 2
1 2 3
2 3 4
3 4 2
4 5 3
n = 1
dataframe.iloc[:n+1] = 0
dataframe
a b
0 0 0
1 0 0
2 3 4
3 4 2
4 5 3
This truncates all for all the previous days. If you want to truncate only for the nth day.
dataframe.iloc[n] = 0

How to pass a value from one row to the next one in pandas + python and use it to calculate the same following value recursively

This is my desired output:
I am trying to calculate the column df[Value] and df[Value_Compensed]. However, to do that, I need to consider the previous value of the row df[Value_Compensed]. In terms of my table:
The first row all the values are 0
The following rows: df[Remained] = previous df[Value_compensed]. Then df[Value] = df[Initial_value] + df[Remained]. Then df[Value_Compensed] = df[Value] - df[Compensation]
...And So on...
I am struggling to pass the value of Value_Compensed from one row to the next, I tried with the function shift() but as you can see in the following image the values in df[Value_Compensed] are not correct due to it is not a static value and also it also changes after each row it did not work. Any Idea??
Thanks.
Manuel.
You can use apply to create your customised operations. I've made a dummy dataset as you didn't provide the initial dataframe.
from itertools import zip_longest
# dummy data
df = pd.DataFrame(np.random.randint(1, 10, (8, 5)),
columns=['compensation', 'initial_value',
'remained', 'value', 'value_compensed'],)
df.loc[0] = 0,0,0,0,0
>>> print(df)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 2 9 1 9 7
2 1 4 9 8 3
3 3 4 5 7 6
4 3 2 5 5 6
5 9 1 5 2 4
6 4 5 9 8 2
7 1 6 9 6 8
Use apply (axis=1) to do row-wise iteration, where you use the initial dataframe as an argument, from which you can then get the previous row x.name-1 and do your calculations. Not sure if I fully understood the intended result, but you can adjust the individual calculations of the different columns in the function.
def f(x, data):
if x.name == 0:
return [0,]*data.shape[1]
else:
x_remained = data.loc[x.name-1]['value_compensed']
x_value = data.loc[x.name-1]['initial_value'] + x_remained
x_compensed = x_value - x['compensation']
return [x['compensation'], x['initial_value'], x_remained, \
x_value, x_compensed]
adj = df.apply(f, args=(df,), axis=1)
adj = pd.DataFrame.from_records(zip_longest(*adj.values), index=df.columns).T
>>> print(adj)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 5 9 0 0 -5
2 5 7 4 13 8
3 7 9 1 8 1
4 6 6 5 14 8
5 4 9 6 12 8
6 2 4 2 11 9
7 9 2 6 10 1

Replace everything that starts with number python

I am working with ICD-9 codes for a data mining project using python and I am having trouble converting the specific codes into categories. For example, I am trying to change everything that's between 001 and 139 with 0, everything that's between 140 and 239 with 1, etc
This is what I have tried:
df = df.replace({'diag_1' : {'(1-139)' : 0, '(140-239)' : 1}})
You can use pd.cut to achieve this:
In [175]:
df = pd.DataFrame({'value':np.random.randint(0,20,10)})
df
Out[175]:
value
0 12
1 2
2 10
3 5
4 19
5 2
6 8
7 14
8 12
9 16
here we set bin intervals of (0-5) (5-15), (15-20):
In [183]:
df['new_value'] = pd.cut(df['value'], bins=[0,5,15,20], labels=[0,1,2])
df
Out[183]:
value new_value
0 12 1
1 2 0
2 10 1
3 5 0
4 19 2
5 2 0
6 8 1
7 14 1
8 12 1
9 16 2
I think in your case the following should work:
df['diag_1']= pd.cut(df['diag_1'], [1,140,240] , labels=[1,2,3])
you can set the bins and labels dynamically using np.arange or similar
There is nothing wrong with an if-statement.
newvalue = 1 if oldvalues <= 139 else 2
Apply this function as a lambda expression with map.

Keeping subset of row labels in pandas DataFrame based on second index

Given a DataFrame with a hierarchical index containing three levels (experiment, trial, slot) and a second DataFrame with a hierarchical index containing two levels (experiment, trial), how do I drop all the rows in the first DataFrame whose (experiment, trial) are not contained in the second dataframe?
Example data:
from io import StringIO
import pandas as pd
df1_data = StringIO(u',experiment,trial,slot,token\n0,btn144a10_p_RDT,0,0,4.0\n1,btn144a10_p_RDT,0,1,14.0\n2,btn144a10_p_RDT,1,0,12.0\n3,btn144a10_p_RDT,1,1,14.0\n4,btn145a07_p_RDT,0,0,6.0\n5,btn145a07_p_RDT,0,1,19.0\n6,btn145a07_p_RDT,1,0,17.0\n7,btn145a07_p_RDT,1,1,13.0\n8,chn004b06_p_RDT,0,0,6.0\n9,chn004b06_p_RDT,0,1,8.0\n10,chn004b06_p_RDT,1,0,2.0\n11,chn004b06_p_RDT,1,1,5.0\n12,chn008a06_p_RDT,0,0,12.0\n13,chn008a06_p_RDT,0,1,14.0\n14,chn008a06_p_RDT,1,0,6.0\n15,chn008a06_p_RDT,1,1,4.0\n16,chn008b06_p_RDT,0,0,3.0\n17,chn008b06_p_RDT,0,1,13.0\n18,chn008b06_p_RDT,1,0,12.0\n19,chn008b06_p_RDT,1,1,19.0\n20,chn008c04_p_RDT,0,0,17.0\n21,chn008c04_p_RDT,0,1,2.0\n22,chn008c04_p_RDT,1,0,1.0\n23,chn008c04_p_RDT,1,1,6.0\n')
df1 = pd.DataFrame.from_csv(df1_data).set_index(['experiment', 'trial', 'slot'])
df2_data = StringIO(u',experiment,trial,target\n0,btn145a07_p_RDT,1,13\n1,chn004b06_p_RDT,1,9\n2,chn008a06_p_RDT,0,15\n3,chn008a06_p_RDT,1,15\n4,chn008b06_p_RDT,1,1\n5,chn008c04_p_RDT,1,12\n')
df2 = pd.DataFrame.from_csv(df2_data).set_index(['experiment', 'trial'])
The first dataframe looks like:
token
experiment trial slot
btn144a10_p_RDT 0 0 4
1 14
1 0 12
1 14
btn145a07_p_RDT 0 0 6
1 19
1 0 17
1 13
chn004b06_p_RDT 0 0 6
1 8
1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 0 0 3
1 13
1 0 12
1 19
chn008c04_p_RDT 0 0 17
1 2
1 0 1
1 6
The second dataframe looks like:
target
experiment trial
btn145a07_p_RDT 1 13
chn004b06_p_RDT 1 9
chn008a06_p_RDT 0 15
1 15
chn008b06_p_RDT 1 1
chn008c04_p_RDT 1 12
The result I want:
token
experiment trial slot
btn145a07_p_RDT 1 0 17
1 13
chn004b06_p_RDT 1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 1 0 12
1 19
chn008c04_p_RDT 1 0 1
1 6
One way to do it would by using merge
merged = pd.merge(
df2.reset_index(),
df1.reset_index(),
left_on=['experiment', 'trial'],
right_on=['experiment', 'trial'],
how='left')
You just need to reindex merged to whatever you like (I could not tell exactly from the question).
What should work is
df1.loc[df2.index]
but multi indexing still has some problems. What does work is
df1.reset_index(2).loc[df2.index].set_index('slot', append=True)
which is a bit of a hack around this problem. Note that
df1.loc[df2.index[:1]]
gives garbage while
df.loc[df2.index[0]]
gives what you would expect. So passing multiple values from a m-level index to an n-level index where n > m > 2 doesn't work, though it should.

Categories