I currently have a column in my dataframe called step, that I want to use to set a counter on. It contains a bunch of repeating numbers. I want to create a new column against this, that has a counter that increments when a certain condition is met. The condition is when the number changes for a fourth time in the column step, the counter will increment by 1, and then repeat the process. Here is an example of my code, and what I'd like to acheive:
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7]})
df['counter'] = df['step'].cumsum() #This will increment when it sees a fourth different number, and repeat
So ideally, my output would look like this:
print(df['step'])
[1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]
print(df['counter'])
[0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,
3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5]
The numbers in step will vary, but the counter will always increment when the fourth different value in the sequence is identified and reset the counter. I know I could probably do this with if statements, but my dataframe is large and I would rather do it in a faster way of comparison, if possible. Any help would be greatly appreciated!
You can convert your step column into categories and then count on the category codes:
import pandas as pd
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,10]})
df["counter"] = df.step.astype("category").values.codes // 3
Result:
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 10 3
Update for changed data (see comment):
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]})
df['counter'] = (df.step.diff().fillna(0).ne(0).cumsum() // 3).astype(int)
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5
50 7 5
Compare the current and previous row in step column to identify boundaries(location of transitions), then use cumsum to assign number to groups of rows and floor divide by 3 to create counter
m = df.step != df.step.shift()
df['counter'] = (m.cumsum() - 1) // 3
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5
I have a DataFrame which holds two columns like below:
player_id days
0 None 1
1 None 1
2 None 1
3 None 1
4 None 1
5 None 1
6 None 2
7 None 2
8 None 2
9 None 2
10 None 2
.
.
82 None 13
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
in output, I need to replace None with the id of players which is 1 to 11, have something like:
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
11 1 2
12 2 2
13 3 2
14 4 2
.
.
82 5 13
83 6 14
83 7 14
83 8 14
83 9 14
83 10 14
83 11 14
this is my code:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = i
print(df)
however I get the following dataframe:
player_id days
0 11 1
1 11 1
2 11 1
3 11 1
4 11 1
5 11 1
6 11 2
7 11 2
8 11 2
9 11 2
10 11 2
11 11 2
12 11 2
13 11 2
14 11 2
.
.
82 11 13
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
I also tried to add a new series as follows, but does not work:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = pd.Series([i, df['day']], index=['player_id', 'day'])
print(df)
I have some doubt if editing a filed in dataframe is possible or not, I just skipped itertuples and iterrows to be able to edit this rows in an efficient way.
try % operator:
import numpy as np
df['player_id'] = 1 + np.arange(len(df))%11
df
output
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
82 1 13
83 2 14
83 3 14
83 4 14
83 5 14
83 6 14
83 7 14
Edit: using index
if the df's index (the first column in the output above) is not sequential and you want the same pattern but based on the index, then you can do
df['player_id'] = 1 + df.index%11
This can be done as.
i=0
for index in range(len(df)):
df.iloc[index, 0] = 1+i%11
i+=1
print(df)
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 1
7 8 1
8 9 1
9 10 1
10 11 1
11 1 2
12 2 2
13 3 2
14 4 2
15 5 2
16 6 2
17 7 2
18 8 2
19 9 2
20 10 2
21 11 2
22 1 3
23 2 3
24 3 3
25 4 3
26 5 3
27 6 3
28 7 3
29 8 3
30 9 3
31 10 3
32 11 3
I'm making a recommender system, and I'd like to have a matrix of ratings (User/Item). My problem is there are only 9066 unique items in the dataset, but their IDs range from 1 to 165201. So I need a way to map the IDs to be in the range of 1 to 9066, instead of 1 to 165201. How do I do that?
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
User=np.random.randint(10, size=20),
Item=np.random.randint(100, size=20)
))
print(df)
Item User
0 27 0
1 77 2
2 54 7
3 39 3
4 23 8
5 84 7
6 37 0
7 99 6
8 87 8
9 37 6
10 63 0
11 25 2
12 11 0
13 71 4
14 44 9
15 70 7
16 4 3
17 71 2
18 63 4
19 86 3
Use unique to get unique values and build a mapping dictionary
u = df.Item.unique()
m = dict(zip(u, range(len(u))))
Then use map to produce the re configured column
df.assign(Item=df.Item.map(m))
Item User
0 0 0
1 1 2
2 2 7
3 3 3
4 4 8
5 5 7
6 6 0
7 7 6
8 8 8
9 6 6
10 9 0
11 10 2
12 11 0
13 12 4
14 13 9
15 14 7
16 15 3
17 12 2
18 9 4
19 16 3
Or we could have accomplished the same thing with pd.factorize
df.assign(Item=pd.factorize(df.Item)[0])
Item User
0 0 0
1 1 2
2 2 7
3 3 3
4 4 8
5 5 7
6 6 0
7 7 6
8 8 8
9 6 6
10 9 0
11 10 2
12 11 0
13 12 4
14 13 9
15 14 7
16 15 3
17 12 2
18 9 4
19 16 3
I would go through and find the item with the smallest id in the list, set it to 1, then find the next smallest, set it to 2, and so on.
edit: you are right. That would take way too long. I would just go through and set one of them to 1, the next one to 2, and so on. It doesn't matter what order the ids are in (I am guessing). When a new item is added just set it to 9067, and so on.