Normalize IDs column - python

I'm making a recommender system, and I'd like to have a matrix of ratings (User/Item). My problem is there are only 9066 unique items in the dataset, but their IDs range from 1 to 165201. So I need a way to map the IDs to be in the range of 1 to 9066, instead of 1 to 165201. How do I do that?

Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
User=np.random.randint(10, size=20),
Item=np.random.randint(100, size=20)
))
print(df)
Item User
0 27 0
1 77 2
2 54 7
3 39 3
4 23 8
5 84 7
6 37 0
7 99 6
8 87 8
9 37 6
10 63 0
11 25 2
12 11 0
13 71 4
14 44 9
15 70 7
16 4 3
17 71 2
18 63 4
19 86 3
Use unique to get unique values and build a mapping dictionary
u = df.Item.unique()
m = dict(zip(u, range(len(u))))
Then use map to produce the re configured column
df.assign(Item=df.Item.map(m))
Item User
0 0 0
1 1 2
2 2 7
3 3 3
4 4 8
5 5 7
6 6 0
7 7 6
8 8 8
9 6 6
10 9 0
11 10 2
12 11 0
13 12 4
14 13 9
15 14 7
16 15 3
17 12 2
18 9 4
19 16 3
​Or we could have accomplished the same thing with pd.factorize
df.assign(Item=pd.factorize(df.Item)[0])
Item User
0 0 0
1 1 2
2 2 7
3 3 3
4 4 8
5 5 7
6 6 0
7 7 6
8 8 8
9 6 6
10 9 0
11 10 2
12 11 0
13 12 4
14 13 9
15 14 7
16 15 3
17 12 2
18 9 4
19 16 3

I would go through and find the item with the smallest id in the list, set it to 1, then find the next smallest, set it to 2, and so on.
edit: you are right. That would take way too long. I would just go through and set one of them to 1, the next one to 2, and so on. It doesn't matter what order the ids are in (I am guessing). When a new item is added just set it to 9067, and so on.

Related

Adding element of a range of values to every N rows in a pandas DataFrame

I have the following dataframe that is ordered and consecutive:
Hour value
0 1 41
1 2 5
2 3 7
3 4 107
4 5 56
5 6 64
6 7 46
7 8 50
8 9 95
9 10 81
10 11 8
11 12 94
I want to add a range of values to each N rows (4 in this case), e.g.:
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3
Using numpy.arange:
import numpy as np
df['val'] = np.arange(len(df))//4+1
Output:
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3
IIUC, you can create val column based from the index as follows:
df['val'] = 1 + df.index//4
print(df)
Output
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3

Create a counter that iterates over a column in a dataframe, and counts when a condition in the column is met

I currently have a column in my dataframe called step, that I want to use to set a counter on. It contains a bunch of repeating numbers. I want to create a new column against this, that has a counter that increments when a certain condition is met. The condition is when the number changes for a fourth time in the column step, the counter will increment by 1, and then repeat the process. Here is an example of my code, and what I'd like to acheive:
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7]})
df['counter'] = df['step'].cumsum() #This will increment when it sees a fourth different number, and repeat
So ideally, my output would look like this:
print(df['step'])
[1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]
print(df['counter'])
[0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,
3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5]
The numbers in step will vary, but the counter will always increment when the fourth different value in the sequence is identified and reset the counter. I know I could probably do this with if statements, but my dataframe is large and I would rather do it in a faster way of comparison, if possible. Any help would be greatly appreciated!
You can convert your step column into categories and then count on the category codes:
import pandas as pd
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,10]})
df["counter"] = df.step.astype("category").values.codes // 3
Result:
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 10 3
Update for changed data (see comment):
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]})
df['counter'] = (df.step.diff().fillna(0).ne(0).cumsum() // 3).astype(int)
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5
50 7 5
Compare the current and previous row in step column to identify boundaries(location of transitions), then use cumsum to assign number to groups of rows and floor divide by 3 to create counter
m = df.step != df.step.shift()
df['counter'] = (m.cumsum() - 1) // 3
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5

how to update each series field in a dataframe

I have a DataFrame which holds two columns like below:
player_id days
0 None 1
1 None 1
2 None 1
3 None 1
4 None 1
5 None 1
6 None 2
7 None 2
8 None 2
9 None 2
10 None 2
.
.
82 None 13
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
in output, I need to replace None with the id of players which is 1 to 11, have something like:
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
11 1 2
12 2 2
13 3 2
14 4 2
.
.
82 5 13
83 6 14
83 7 14
83 8 14
83 9 14
83 10 14
83 11 14
this is my code:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = i
print(df)
however I get the following dataframe:
player_id days
0 11 1
1 11 1
2 11 1
3 11 1
4 11 1
5 11 1
6 11 2
7 11 2
8 11 2
9 11 2
10 11 2
11 11 2
12 11 2
13 11 2
14 11 2
.
.
82 11 13
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
I also tried to add a new series as follows, but does not work:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = pd.Series([i, df['day']], index=['player_id', 'day'])
print(df)
I have some doubt if editing a filed in dataframe is possible or not, I just skipped itertuples and iterrows to be able to edit this rows in an efficient way.
try % operator:
import numpy as np
df['player_id'] = 1 + np.arange(len(df))%11
df
output
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
82 1 13
83 2 14
83 3 14
83 4 14
83 5 14
83 6 14
83 7 14
Edit: using index
if the df's index (the first column in the output above) is not sequential and you want the same pattern but based on the index, then you can do
df['player_id'] = 1 + df.index%11
This can be done as.
i=0
for index in range(len(df)):
df.iloc[index, 0] = 1+i%11
i+=1
print(df)
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 1
7 8 1
8 9 1
9 10 1
10 11 1
11 1 2
12 2 2
13 3 2
14 4 2
15 5 2
16 6 2
17 7 2
18 8 2
19 9 2
20 10 2
21 11 2
22 1 3
23 2 3
24 3 3
25 4 3
26 5 3
27 6 3
28 7 3
29 8 3
30 9 3
31 10 3
32 11 3

Pandas create new column ID based on values from other columns need to be matched

I'm new to programming and Pandas. Therefore, please do not judge strictly and sorry for my explanations.
I have basically two-columns (DM1_ID, DM2_ID) and I need to create a new column('NewID') base on those two columns values. Basically I'm doing is creating a new ID for both columns. Here first evaluate the value in the 1st column and get that value and put it into the 'NewID' column.
Also, when we do that, need to consider DM2_ID and when that id comes in DM1_ID I need to give the same DM1_ID in NewID column.
As an example in 0 indexe has DM1_ID 1 and DM2_ID 6, I need to put 1 as NewID for both ids. When DM1_ID comes to 6 (index 15) no matter what in DM2_ID I need to give the 1 as NewID since I gave both DM1_ID 1 and DM1_ID 6. So it will be 1. Also, I need to consider that DM2_ID to latter use and it'll be also 1.
(index 15 DM1_ID 6, and DM2_ID 45 since I already gave newId as 1 for both 1 and 6 I have to give 1 for DM1_ID 6. Also for 45, I need to give 1 as a NewID(index 21).)
#I have a large table like this
DM1_ID DM2_ID
0 1 6
1 1 7
2 1 15
3 2 5
4 2 10
5 3 21
6 3 28
7 3 32
8 3 35
9 4 39
10 5 2
11 5 10
12 6 1
13 6 7
14 6 15
15 6 45
16 6 55
17 7 1
18 7 6
19 7 15
20 10 75
21 45 120
22 45 10
23 10 27
24 10 28
25 2 335
#I need to create this table
DM1_ID DM2_ID abc
0 1 6 1
1 1 7 1
2 1 15 1
3 2 5 2
4 2 10 2
5 3 21 3
6 3 28 3
7 3 32 3
8 3 35 3
9 4 39 4
10 5 2 2
11 5 10 2
12 6 1 1
13 6 7 1
14 6 15 1
15 6 45 1
16 6 55 1
17 7 1 1
18 7 6 1
19 7 15 1
20 10 75 2
21 45 120 1
22 45 10 2
23 10 27 2
24 10 28 2
25 2 335 2
Any help would be appreciated. Thanks.
One way to achieve your goal is to persist your IDs first. You can then use this persisted map table/dictionary to assign uniqued IDs once conditions are met. I have included an example with dictionary as below but you can alternatively use a database or a JSON file for persisting your given IDs:
df['pairs'] = df.apply(lambda x: [x[0], x[1]], axis=1)
pairs = df['pairs'].tolist()
u = {}
u_ = {}
for p in pairs:
if u:
if not u_:
u_ = u.copy()
else:
u = u_.copy()
for k in list(u.keys()):
if any(x in u[k] for x in p):
u_.update(
{
k: list(set(u[k] + p))
}
)
else:
pass
vals = [j for i in list(u.values()) for j in i]
if u == u_ and not any(x in vals for x in p):
n = max(list(u_.keys())) + 1
u_[n] = p
else:
pass
else:
u[1] = p
u_
Output:
{1: [1, 6, 7, 45, 15, 55, 120],
2: [75, 2, 10, 5],
3: [32, 35, 3, 21, 28],
4: [4, 39]}
Now let's apply a function that assigns new ID per row based on the dictionary we have created in the previous step:
f = lambda x: next(k for k,v in u_.items() if any(i in v for i in x))
df['new_ID'] = df['pairs'].apply(f)
df.drop('pairs', axis=1, inplace=True)
df
Output:
DM1_ID DM2_ID new_ID
0 1 6 1
1 1 7 1
2 1 15 1
3 2 5 2
4 2 10 2
5 3 21 3
6 3 28 3
7 3 32 3
8 3 35 3
9 4 39 4
10 5 2 2
11 5 10 2
12 6 1 1
13 6 7 1
14 6 15 1
15 6 45 1
16 6 55 1
17 7 1 1
18 7 6 1
19 7 15 1
20 10 75 2
21 45 120 1

Remove duplicates but keep some

Is it possible to remove duplicates but keep last 3-4 ? Something like:
df = df.drop_duplicates(['ID'], keep='last_four')
Thank you
You can use groupby and tail and pass the num of rows you wish to keep to achieve the same result:
In [5]:
# data setup
df = pd.DataFrame({'ID':[0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,4], 'val':np.arange(25)})
df
Out[5]:
ID val
0 0 0
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
15 3 15
16 3 16
17 3 17
18 3 18
19 3 19
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24
Now groupby and call tail:
In [11]:
df.groupby('ID',as_index=False).tail(4)
Out[11]:
ID val
2 0 2
3 0 3
4 0 4
5 0 5
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24

Categories