Pandas Add an incremental number based on another column - python

Consider a dataframe with a column like this:
sequence
1
2
3
4
5
1
2
3
1
2
3
4
5
6
7
I wish to create a column when the sequence resets. The sequence is of variable length.
Such that I'd get something like:
sequence run
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
5 3
6 3
7 3

Try with diff then cumsum
df['run'] = df['sequence'].diff().ne(1).cumsum()
Out[349]:
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 3
12 3
13 3
14 3
Name: sequence, dtype: int32

Use:
dataset['run'] = dataset.groupby('sequence ').cumcount().add(1)
output example:
sequence run
y 1
a 1
g 1
a 2
b 1
a 3
b 2

Related

add a new counter based on an existing counter in python pandas

I have a Series that look like this
col
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 1
8 2
9 3
10 1
11 2
and I would like to generate a second counter that looks like this
col col2
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
How can I do that in python?
If 1 is always start of groups then create mask by compare by Series.eq and then add Series.cumsum for cumulative sum:
df['col2'] = df['col'].eq(1).cumsum()
print (df)
col col2
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4

Pandas DataFrame assign hirachic number to element

I have the following Dataframe:
a b c d
0 1 4 9 2
1 2 5 8 7
2 4 6 2 3
3 3 2 7 5
I want to assign a number to each element in a row according to it's order. The result should look like this:
a b c d
0 1 3 4 2
1 1 2 4 3
2 3 4 1 2
3 2 1 4 3
I tried to use the np.argsort function which doesn't work. Does someone know an easy way to to this? Thanks.
Use DataFrame.rank:
df = df.rank(axis=1).astype(int)
print (df)
a b c d
0 1 3 4 2
1 1 2 4 3
2 3 4 1 2
3 2 1 4 3

need to filter rows present in one dataframe on another

I have two data frames in pandas from which i need to get the rows with all the corresponding column values in second which are not in first .
ex
df A
A B C D
6 4 1 6
7 6 6 3
1 6 2 9
8 0 4 9
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
3 2 8 8
5 2 8 8
df B
A B C D
1 0 2 3
8 4 7 5
4 7 1 1
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
1 1 1 1
2 2 2 2
1 1 1 1
req
A B C D
1 1 1 1
2 2 2 2
1 1 1 1
i tried using pd.merge and inner/left on all columns but it is taking a lot more computational time and resource if the rows and columns are more. is there any other way to work it around like iterating through each row of dfA with dfB on all columns and then pick the ones which are there only in dfB?
You can use merge with ind parameter.
df_b.merge(df_a, on=['A','B','C','D'],
how='left', indicator='ind')\
.query('ind == "left_only"')\
.drop('ind', axis=1)
Output:
A B C D
9 1 1 1 1
10 2 2 2 2
11 1 1 1 1

With python dataframes add column of counts of rows that meet condition to each row that meets it

Say I have a python DataFrame with the following structure:
pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,3,5,6],[1,4,6,7],[1,4,6,7],[1,4,6,7]])
Out[262]:
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 1 3 5 6
3 1 4 6 7
4 1 4 6 7
5 1 4 6 7
How can I add a column called 'ct' that counts the instances of the DataFrame where column 1-3 match to each row that matches... so the DataFrame would look like this when all is completed.
0 1 2 3 ct
0 1 2 3 4 2
1 1 2 3 4 2
2 1 3 5 6 1
3 1 4 6 7 3
4 1 4 6 7 3
5 1 4 6 7 3
You can use groupby + transform + size:
df['ct'] = df.groupby([1,2,3])[1].transform('size')
#alternatively
#df['ct'] = df.groupby([1,2,3])[1].transform(len)
print (df)
0 1 2 3 ct
0 1 2 3 4 2
1 1 2 3 4 2
2 1 3 5 6 1
3 1 4 6 7 3
4 1 4 6 7 3
5 1 4 6 7 3

Separating elements of a Pandas DataFrame in Python

I have a pandas DataFrame that looks like the following:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
which can be generated with the following code:
import pandas
time=[0,1,2,3,4]
repeat_1_conc_1=[1,2,3,4,5]
repeat_1_conc_2=[2,3,4,5,6]
repeat_1_conc_3=[3,4,5,6,7]
d1=pandas.DataFrame([time,repeat_1_conc_1]).transpose()
d2=pandas.DataFrame([time,repeat_1_conc_2]).transpose()
d3=pandas.DataFrame([time,repeat_1_conc_3]).transpose()
repeat_2_conc_1=[1,2,3,4,5]
repeat_2_conc_2=[2,3,4,5,6]
repeat_2_conc_3=[3,4,5,6,7]
d4=pandas.DataFrame([time,repeat_2_conc_1]).transpose()
d5=pandas.DataFrame([time,repeat_2_conc_2]).transpose()
d6=pandas.DataFrame([time,repeat_2_conc_3]).transpose()
df= pandas.concat([d1,d2,d3,d4,d5,d6]).reset_index()
df.drop('index',axis=1,inplace=True)
df.columns=['Time','Measurement']
print df
If you look at the code, you'll see that I have two experimental repeats in the same DataFrame which should be separated at df.iloc[:15]. Additionally, within each experiment I have 3 sub-experiments that can be thought of like the starting conditions of a dose response, i.e. first sub-experiment starts with 1, second with 2 and third with 3. These should be separated at index intervals of `len(time)', which is 0-4, 5 elements for each experimental repeat. Could somebody please tell me the best way to separate this data into individual time course measurements for each experiment? I'm not exactly sure what the best data structure would be to use but I just need to be able to access each data for each sub experiment for each experimental repeat easily. Perhaps sometime like:
repeat1=
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
Repeat 2=
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
IIUC, you may set a multiindex so that you can index your DF accessing experiments and subexperiments easily:
In [261]: dfi = df.set_index([df.index//15+1, df.index//5 - df.index//15*3 + 1])
In [262]: dfi
Out[262]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
2 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
selecting subexperiments
In [263]: dfi.loc[1,1]
Out[263]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
In [264]: dfi.loc[2,2]
Out[264]:
Time Measurement
2 2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
select second experiment with all subexperiments:
In [266]: dfi.loc[2,:]
Out[266]:
Time Measurement
1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
alternatively you can create your own slicing function:
def my_slice(rep=1, subexp=1):
rep -= 1
subexp -= 1
return df.ix[rep*15 + subexp*5 : rep*15 + subexp*5 + 4, :]
demo:
In [174]: my_slice(1,1)
Out[174]:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
In [175]: my_slice(2,1)
Out[175]:
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
In [176]: my_slice(2,2)
Out[176]:
Time Measurement
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
PS bit more convenient way to concatenate your DFs:
df = pandas.concat([d1,d2,d3,d4,d5,d6], ignore_index=True)
so you don't need the following .reset_index() and drop()

Categories