I currently have a column in my dataframe called step, that I want to use to set a counter on. It contains a bunch of repeating numbers. I want to create a new column against this, that has a counter that increments when a certain condition is met. The condition is when the number changes for a fourth time in the column step, the counter will increment by 1, and then repeat the process. Here is an example of my code, and what I'd like to acheive:
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7]})
df['counter'] = df['step'].cumsum() #This will increment when it sees a fourth different number, and repeat
So ideally, my output would look like this:
print(df['step'])
[1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]
print(df['counter'])
[0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,
3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5]
The numbers in step will vary, but the counter will always increment when the fourth different value in the sequence is identified and reset the counter. I know I could probably do this with if statements, but my dataframe is large and I would rather do it in a faster way of comparison, if possible. Any help would be greatly appreciated!
You can convert your step column into categories and then count on the category codes:
import pandas as pd
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,10]})
df["counter"] = df.step.astype("category").values.codes // 3
Result:
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 10 3
Update for changed data (see comment):
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]})
df['counter'] = (df.step.diff().fillna(0).ne(0).cumsum() // 3).astype(int)
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5
50 7 5
Compare the current and previous row in step column to identify boundaries(location of transitions), then use cumsum to assign number to groups of rows and floor divide by 3 to create counter
m = df.step != df.step.shift()
df['counter'] = (m.cumsum() - 1) // 3
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5
I have a DataFrame which holds two columns like below:
player_id days
0 None 1
1 None 1
2 None 1
3 None 1
4 None 1
5 None 1
6 None 2
7 None 2
8 None 2
9 None 2
10 None 2
.
.
82 None 13
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
in output, I need to replace None with the id of players which is 1 to 11, have something like:
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
11 1 2
12 2 2
13 3 2
14 4 2
.
.
82 5 13
83 6 14
83 7 14
83 8 14
83 9 14
83 10 14
83 11 14
this is my code:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = i
print(df)
however I get the following dataframe:
player_id days
0 11 1
1 11 1
2 11 1
3 11 1
4 11 1
5 11 1
6 11 2
7 11 2
8 11 2
9 11 2
10 11 2
11 11 2
12 11 2
13 11 2
14 11 2
.
.
82 11 13
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
I also tried to add a new series as follows, but does not work:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = pd.Series([i, df['day']], index=['player_id', 'day'])
print(df)
I have some doubt if editing a filed in dataframe is possible or not, I just skipped itertuples and iterrows to be able to edit this rows in an efficient way.
try % operator:
import numpy as np
df['player_id'] = 1 + np.arange(len(df))%11
df
output
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
82 1 13
83 2 14
83 3 14
83 4 14
83 5 14
83 6 14
83 7 14
Edit: using index
if the df's index (the first column in the output above) is not sequential and you want the same pattern but based on the index, then you can do
df['player_id'] = 1 + df.index%11
This can be done as.
i=0
for index in range(len(df)):
df.iloc[index, 0] = 1+i%11
i+=1
print(df)
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 1
7 8 1
8 9 1
9 10 1
10 11 1
11 1 2
12 2 2
13 3 2
14 4 2
15 5 2
16 6 2
17 7 2
18 8 2
19 9 2
20 10 2
21 11 2
22 1 3
23 2 3
24 3 3
25 4 3
26 5 3
27 6 3
28 7 3
29 8 3
30 9 3
31 10 3
32 11 3
I'm new to programming and Pandas. Therefore, please do not judge strictly and sorry for my explanations.
I have basically two-columns (DM1_ID, DM2_ID) and I need to create a new column('NewID') base on those two columns values. Basically I'm doing is creating a new ID for both columns. Here first evaluate the value in the 1st column and get that value and put it into the 'NewID' column.
Also, when we do that, need to consider DM2_ID and when that id comes in DM1_ID I need to give the same DM1_ID in NewID column.
As an example in 0 indexe has DM1_ID 1 and DM2_ID 6, I need to put 1 as NewID for both ids. When DM1_ID comes to 6 (index 15) no matter what in DM2_ID I need to give the 1 as NewID since I gave both DM1_ID 1 and DM1_ID 6. So it will be 1. Also, I need to consider that DM2_ID to latter use and it'll be also 1.
(index 15 DM1_ID 6, and DM2_ID 45 since I already gave newId as 1 for both 1 and 6 I have to give 1 for DM1_ID 6. Also for 45, I need to give 1 as a NewID(index 21).)
#I have a large table like this
DM1_ID DM2_ID
0 1 6
1 1 7
2 1 15
3 2 5
4 2 10
5 3 21
6 3 28
7 3 32
8 3 35
9 4 39
10 5 2
11 5 10
12 6 1
13 6 7
14 6 15
15 6 45
16 6 55
17 7 1
18 7 6
19 7 15
20 10 75
21 45 120
22 45 10
23 10 27
24 10 28
25 2 335
#I need to create this table
DM1_ID DM2_ID abc
0 1 6 1
1 1 7 1
2 1 15 1
3 2 5 2
4 2 10 2
5 3 21 3
6 3 28 3
7 3 32 3
8 3 35 3
9 4 39 4
10 5 2 2
11 5 10 2
12 6 1 1
13 6 7 1
14 6 15 1
15 6 45 1
16 6 55 1
17 7 1 1
18 7 6 1
19 7 15 1
20 10 75 2
21 45 120 1
22 45 10 2
23 10 27 2
24 10 28 2
25 2 335 2
Any help would be appreciated. Thanks.
One way to achieve your goal is to persist your IDs first. You can then use this persisted map table/dictionary to assign uniqued IDs once conditions are met. I have included an example with dictionary as below but you can alternatively use a database or a JSON file for persisting your given IDs:
df['pairs'] = df.apply(lambda x: [x[0], x[1]], axis=1)
pairs = df['pairs'].tolist()
u = {}
u_ = {}
for p in pairs:
if u:
if not u_:
u_ = u.copy()
else:
u = u_.copy()
for k in list(u.keys()):
if any(x in u[k] for x in p):
u_.update(
{
k: list(set(u[k] + p))
}
)
else:
pass
vals = [j for i in list(u.values()) for j in i]
if u == u_ and not any(x in vals for x in p):
n = max(list(u_.keys())) + 1
u_[n] = p
else:
pass
else:
u[1] = p
u_
Output:
{1: [1, 6, 7, 45, 15, 55, 120],
2: [75, 2, 10, 5],
3: [32, 35, 3, 21, 28],
4: [4, 39]}
Now let's apply a function that assigns new ID per row based on the dictionary we have created in the previous step:
f = lambda x: next(k for k,v in u_.items() if any(i in v for i in x))
df['new_ID'] = df['pairs'].apply(f)
df.drop('pairs', axis=1, inplace=True)
df
Output:
DM1_ID DM2_ID new_ID
0 1 6 1
1 1 7 1
2 1 15 1
3 2 5 2
4 2 10 2
5 3 21 3
6 3 28 3
7 3 32 3
8 3 35 3
9 4 39 4
10 5 2 2
11 5 10 2
12 6 1 1
13 6 7 1
14 6 15 1
15 6 45 1
16 6 55 1
17 7 1 1
18 7 6 1
19 7 15 1
20 10 75 2
21 45 120 1
Is it possible to remove duplicates but keep last 3-4 ? Something like:
df = df.drop_duplicates(['ID'], keep='last_four')
Thank you
You can use groupby and tail and pass the num of rows you wish to keep to achieve the same result:
In [5]:
# data setup
df = pd.DataFrame({'ID':[0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,4], 'val':np.arange(25)})
df
Out[5]:
ID val
0 0 0
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
15 3 15
16 3 16
17 3 17
18 3 18
19 3 19
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24
Now groupby and call tail:
In [11]:
df.groupby('ID',as_index=False).tail(4)
Out[11]:
ID val
2 0 2
3 0 3
4 0 4
5 0 5
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24