Change repeating groups in a column to incremental groups - python

I have the following dataframe:
df = pd.DataFrame({'group_nr':[0,0,1,1,1,2,2,3,3,0,0,1,1,2,2,2,3,3]})
print(df)
group_nr
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 0
10 0
11 1
12 1
13 2
14 2
15 2
16 3
17 3
and would like to change from repeating group numbers to incremental group numbers:
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
I can't find a way of doing this without looping through the rows. Does someone have an idea how to implement this nicely?

You can check if the values are equal to the following, and take a cumsum of the boolean series to generate the groups:
df['incremental_group_nr'] = df.group_nr.ne(df.group_nr.shift()).cumsum().sub(1)
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7

Compare by shifted values by Series.shift with not equal by Series.ne and then add cumulative sum with subract 1:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift()).cumsum() - 1
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
Another idea is use backfilling first missing value after shift by bfill:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift().bfill()).cumsum()
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7

Related

pandas restart cumsum every time the value is zero

so I have a series, I want to cumsum, but start over every time I hit a 0, somthing like this:
orig
wanted result
0
0
0
1
1
1
2
1
2
3
1
3
4
1
4
5
1
5
6
1
6
7
0
0
8
1
1
9
1
2
10
1
3
11
0
0
12
1
1
13
1
2
14
1
3
15
1
4
16
1
5
17
1
6
any ideas? (pandas, pure python, other)
Use df['orig'].eq(0).cumsum() to generate groups starting on each 0, then cumcount to get the increasing values:
df['result'] = df.groupby(df['orig'].eq(0).cumsum()).cumcount()
output:
orig wanted result result
0 0 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 1 5 5
6 1 6 6
7 0 0 0
8 1 1 1
9 1 2 2
10 1 3 3
11 0 0 0
12 1 1 1
13 1 2 2
14 1 3 3
15 1 4 4
16 1 5 5
17 1 6 6
Intermediate:
df['orig'].eq(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
11 3
12 3
13 3
14 3
15 3
16 3
17 3
Name: orig, dtype: int64
import pandas as pd
condition = df.Orig.eq(0)
df['reset'] = condition.cumsum()

How to get a uniqueId for the following colum in pandas dataframe?

I have the following column in my pandas dataframe named as FailureLabel
ID FailureLabel
0 1 1
1 2 1
2 3 1
3 4 0
4 5 0
5 6 0
6 7 0
7 8 1
8 9 1
9 10 0
10 11 0
11 12 1
12 13 1
I would like to assign a unique_id to this column such that eachs 1's have a unique id whereas all zeros + the next one have a common "unique id".
I tried using the following code ,
df['unique_id'] = (df['FailureLabel'] | (df['FailureLabel']!=df['FailureLabel'].shift())).cumsum()
which gives me the following output,
ID FailureLabel unique_id
0 1 1 1
1 2 1 2
2 3 1 3
3 4 0 4
4 5 0 4
5 6 0 4
6 7 0 4
7 8 1 5
8 9 1 6
9 10 0 7
10 11 0 7
11 12 1 8
12 13 1 9
But what I desire is,
ID FailureLabel unique_id
0 1 1 1
1 2 1 2
2 3 1 3
3 4 0 4
4 5 0 4
5 6 0 4
6 7 0 4
7 8 1 4
8 9 1 5
9 10 0 6
10 11 0 6
11 12 1 6
12 13 1 7
Use Series.shift with backfilling first value, compare by 1 and add cumulative sum:
df['unique_id'] = df['FailureLabel'].shift().bfill().eq(1).cumsum()
print (df)
ID FailureLabel unique_id
0 1 1 1
1 2 1 2
2 3 1 3
3 4 0 4
4 5 0 4
5 6 0 4
6 7 0 4
7 8 1 4
8 9 1 5
9 10 0 6
10 11 0 6
11 12 1 6
12 13 1 7

Assign 1 value to random sample of group where the sample size is equal to the value of another column

I want to randomly assign 1 value to the IsShade column (output) such that value 1 can be assigned only D times (see column Shading for ex 2 times or 5 times or 3 times) and have to iterate it for E times (Total column for ex 6 times or 8 times or 5 times)
There are 1 million rows of dataset and attached is sample input and image.
Input:
In[1]:
Sr Series Parallel Shading Total Cell
0 0 3 2 2 6 1
1 1 3 2 2 6 2
2 2 3 2 2 6 3
3 3 3 2 2 6 4
4 4 3 2 2 6 5
5 5 3 2 2 6 6
6 6 4 2 5 8 1
7 7 4 2 5 8 2
8 8 4 2 5 8 3
9 9 4 2 5 8 4
10 10 4 2 5 8 5
11 11 4 2 5 8 6
12 12 4 2 5 8 7
13 13 4 2 5 8 8
14 14 5 1 3 5 1
15 15 5 1 3 5 2
16 16 5 1 3 5 3
17 17 5 1 3 5 4
18 18 5 1 3 5 5
If you can help me in how to achieve or python code that will be helpful. Thank you and appreciate it.
Example Expected Output:
Out[1]:
Sr Series Parallel Shading Total Cell IsShade
0 0 3 2 2 6 1 0
1 1 3 2 2 6 2 0
2 2 3 2 2 6 3 1
3 3 3 2 2 6 4 0
4 4 3 2 2 6 5 0
5 5 3 2 2 6 6 1
6 6 4 2 5 8 1 1
7 7 4 2 5 8 2 0
8 8 4 2 5 8 3 1
9 9 4 2 5 8 4 1
10 10 4 2 5 8 5 0
11 11 4 2 5 8 6 0
12 12 4 2 5 8 7 1
13 13 4 2 5 8 8 1
14 14 5 1 3 5 1 0
15 15 5 1 3 5 2 1
16 16 5 1 3 5 3 0
17 17 5 1 3 5 4 1
18 18 5 1 3 5 5 1
You can create a new column that does a .groupby and randomly selects x number of rows based off the integer in the Shading column using .sample. From there, I returned True or False and converted to an integer (True becomes 1 and False becomes 0 with .astype(int)):
s = df['Series'].ne(df['Series'].shift()).cumsum() #s is a unique identifier group
df['IsShade'] = (df.groupby(s, group_keys=False)
.apply(lambda x: x['Shading'].sample(x['Shading'].iloc[0])) > 0)
df['IsShade'] = df['IsShade'].fillna(False).astype(int)
df
Out[1]:
Sr Series Parallel Shading Total Cell IsShade
0 0 3 2 2 6 1 0
1 1 3 2 2 6 2 0
2 2 3 2 2 6 3 0
3 3 3 2 2 6 4 0
4 4 3 2 2 6 5 1
5 5 3 2 2 6 6 1
6 6 4 2 5 8 1 1
7 7 4 2 5 8 2 1
8 8 4 2 5 8 3 0
9 9 4 2 5 8 4 0
10 10 4 2 5 8 5 1
11 11 4 2 5 8 6 1
12 12 4 2 5 8 7 1
13 13 4 2 5 8 8 0
14 14 5 1 3 5 1 1
15 15 5 1 3 5 2 0
16 16 5 1 3 5 3 0
17 17 5 1 3 5 4 1
18 18 5 1 3 5 5 1

Stacking Pandas Dataframe without dropping row

Currently, I have a dataframe like this:
0 0 0 3 0 0
0 7 8 9 1 0
0 4 5 2 4 0
My code to stack it:
dt = dataset.iloc[:,0:7].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
dt['variable'] = pandas.Categorical(dt.index).codes+1
dt.rename(columns={0:index_column_name}, inplace=True)
dt.set_index(index_column_name, inplace=True)
dt['variable'] = numpy.sort(dt['variable'])
However, it drops the first row when I'm stacking it, and I want to keep the headers / first row, how would I achieve this?
In essence, I'm losing the data from the first row (a.k.a column headers) and I want to keep it.
Desired Output:
value,variable
0 1
0 1
0 1
0 2
7 2
4 2
0 3
8 3
5 3
3 4
9 4
2 4
0 5
1 5
4 5
0 6
0 6
0 6
Current output:
value,variable
0 1
0 1
7 2
4 2
8 3
5 3
9 4
2 4
1 5
4 5
0 6
0 6
Why not use df.melt as #WeNYoBen mentioned?
print(df)
1 2 3 4 5 6
0 0 0 0 3 0 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
print(df.melt())
variable value
0 1 0
1 1 0
2 1 0
3 2 0
4 2 7
5 2 4
6 3 0
7 3 8
8 3 5
9 4 3
10 4 9
11 4 2
12 5 0
13 5 1
14 5 4
15 6 0
16 6 0
17 6 0

Separating elements of a Pandas DataFrame in Python

I have a pandas DataFrame that looks like the following:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
which can be generated with the following code:
import pandas
time=[0,1,2,3,4]
repeat_1_conc_1=[1,2,3,4,5]
repeat_1_conc_2=[2,3,4,5,6]
repeat_1_conc_3=[3,4,5,6,7]
d1=pandas.DataFrame([time,repeat_1_conc_1]).transpose()
d2=pandas.DataFrame([time,repeat_1_conc_2]).transpose()
d3=pandas.DataFrame([time,repeat_1_conc_3]).transpose()
repeat_2_conc_1=[1,2,3,4,5]
repeat_2_conc_2=[2,3,4,5,6]
repeat_2_conc_3=[3,4,5,6,7]
d4=pandas.DataFrame([time,repeat_2_conc_1]).transpose()
d5=pandas.DataFrame([time,repeat_2_conc_2]).transpose()
d6=pandas.DataFrame([time,repeat_2_conc_3]).transpose()
df= pandas.concat([d1,d2,d3,d4,d5,d6]).reset_index()
df.drop('index',axis=1,inplace=True)
df.columns=['Time','Measurement']
print df
If you look at the code, you'll see that I have two experimental repeats in the same DataFrame which should be separated at df.iloc[:15]. Additionally, within each experiment I have 3 sub-experiments that can be thought of like the starting conditions of a dose response, i.e. first sub-experiment starts with 1, second with 2 and third with 3. These should be separated at index intervals of `len(time)', which is 0-4, 5 elements for each experimental repeat. Could somebody please tell me the best way to separate this data into individual time course measurements for each experiment? I'm not exactly sure what the best data structure would be to use but I just need to be able to access each data for each sub experiment for each experimental repeat easily. Perhaps sometime like:
repeat1=
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
Repeat 2=
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
IIUC, you may set a multiindex so that you can index your DF accessing experiments and subexperiments easily:
In [261]: dfi = df.set_index([df.index//15+1, df.index//5 - df.index//15*3 + 1])
In [262]: dfi
Out[262]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
2 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
selecting subexperiments
In [263]: dfi.loc[1,1]
Out[263]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
In [264]: dfi.loc[2,2]
Out[264]:
Time Measurement
2 2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
select second experiment with all subexperiments:
In [266]: dfi.loc[2,:]
Out[266]:
Time Measurement
1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
alternatively you can create your own slicing function:
def my_slice(rep=1, subexp=1):
rep -= 1
subexp -= 1
return df.ix[rep*15 + subexp*5 : rep*15 + subexp*5 + 4, :]
demo:
In [174]: my_slice(1,1)
Out[174]:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
In [175]: my_slice(2,1)
Out[175]:
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
In [176]: my_slice(2,2)
Out[176]:
Time Measurement
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
PS bit more convenient way to concatenate your DFs:
df = pandas.concat([d1,d2,d3,d4,d5,d6], ignore_index=True)
so you don't need the following .reset_index() and drop()

Categories