how to update each series field in a dataframe - python

I have a DataFrame which holds two columns like below:
player_id days
0 None 1
1 None 1
2 None 1
3 None 1
4 None 1
5 None 1
6 None 2
7 None 2
8 None 2
9 None 2
10 None 2
.
.
82 None 13
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
in output, I need to replace None with the id of players which is 1 to 11, have something like:
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
11 1 2
12 2 2
13 3 2
14 4 2
.
.
82 5 13
83 6 14
83 7 14
83 8 14
83 9 14
83 10 14
83 11 14
this is my code:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = i
print(df)
however I get the following dataframe:
player_id days
0 11 1
1 11 1
2 11 1
3 11 1
4 11 1
5 11 1
6 11 2
7 11 2
8 11 2
9 11 2
10 11 2
11 11 2
12 11 2
13 11 2
14 11 2
.
.
82 11 13
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
I also tried to add a new series as follows, but does not work:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = pd.Series([i, df['day']], index=['player_id', 'day'])
print(df)
I have some doubt if editing a filed in dataframe is possible or not, I just skipped itertuples and iterrows to be able to edit this rows in an efficient way.

try % operator:
import numpy as np
df['player_id'] = 1 + np.arange(len(df))%11
df
output
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
82 1 13
83 2 14
83 3 14
83 4 14
83 5 14
83 6 14
83 7 14
Edit: using index
if the df's index (the first column in the output above) is not sequential and you want the same pattern but based on the index, then you can do
df['player_id'] = 1 + df.index%11

This can be done as.
i=0
for index in range(len(df)):
df.iloc[index, 0] = 1+i%11
i+=1
print(df)
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 1
7 8 1
8 9 1
9 10 1
10 11 1
11 1 2
12 2 2
13 3 2
14 4 2
15 5 2
16 6 2
17 7 2
18 8 2
19 9 2
20 10 2
21 11 2
22 1 3
23 2 3
24 3 3
25 4 3
26 5 3
27 6 3
28 7 3
29 8 3
30 9 3
31 10 3
32 11 3

Related

Adding element of a range of values to every N rows in a pandas DataFrame

I have the following dataframe that is ordered and consecutive:
Hour value
0 1 41
1 2 5
2 3 7
3 4 107
4 5 56
5 6 64
6 7 46
7 8 50
8 9 95
9 10 81
10 11 8
11 12 94
I want to add a range of values to each N rows (4 in this case), e.g.:
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3
Using numpy.arange:
import numpy as np
df['val'] = np.arange(len(df))//4+1
Output:
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3
IIUC, you can create val column based from the index as follows:
df['val'] = 1 + df.index//4
print(df)
Output
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3

Create a counter that iterates over a column in a dataframe, and counts when a condition in the column is met

I currently have a column in my dataframe called step, that I want to use to set a counter on. It contains a bunch of repeating numbers. I want to create a new column against this, that has a counter that increments when a certain condition is met. The condition is when the number changes for a fourth time in the column step, the counter will increment by 1, and then repeat the process. Here is an example of my code, and what I'd like to acheive:
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7]})
df['counter'] = df['step'].cumsum() #This will increment when it sees a fourth different number, and repeat
So ideally, my output would look like this:
print(df['step'])
[1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]
print(df['counter'])
[0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,
3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5]
The numbers in step will vary, but the counter will always increment when the fourth different value in the sequence is identified and reset the counter. I know I could probably do this with if statements, but my dataframe is large and I would rather do it in a faster way of comparison, if possible. Any help would be greatly appreciated!
You can convert your step column into categories and then count on the category codes:
import pandas as pd
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,10]})
df["counter"] = df.step.astype("category").values.codes // 3
Result:
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 10 3
Update for changed data (see comment):
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]})
df['counter'] = (df.step.diff().fillna(0).ne(0).cumsum() // 3).astype(int)
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5
50 7 5
Compare the current and previous row in step column to identify boundaries(location of transitions), then use cumsum to assign number to groups of rows and floor divide by 3 to create counter
m = df.step != df.step.shift()
df['counter'] = (m.cumsum() - 1) // 3
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5

reshape Pandas dataframe by appending column to column

i do have a Pandas df like (df1):
0 1 2 3 4 5
0 a b c d e f
1 1 4 7 10 13 16
2 2 5 8 11 14 17
3 3 6 9 12 15 18
and i want to generate an Dataframe like (df2):
0 1 2
0 a b c
1 1 4 7
2 2 5 7
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
additional information about the given df:
shape of given df ist unknown. b = df1.shape() -> b = [n,m]
it is a given fact the width of df1 is divisble by 3
i did try stack, melt and wide_to_long. By using stack the order of the rows is lost, the rows should behave as shown in exmeplary df2 . I would really appreciate any help.
Kind regards Hans
Use np.vstack and np.hsplit:
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
Another example:
>>> df
0 1 2 3 4 5 6 7 8
0 a b c d e f g h i
1 1 4 7 10 13 16 19 22 25
2 2 5 8 11 14 17 20 23 26
3 3 6 9 12 15 18 21 24 27
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
8 g h i
9 19 22 25
10 20 23 26
11 21 24 27
You can use DataFrame.append:
a = df[df.columns[: len(df.columns) // 3 + 1]]
b = df[df.columns[len(df.columns) // 3 + 1 :]]
b.columns = a.columns
df_out = a.append(b).reset_index(drop=True)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
EDIT: To handle unknown widths:
dfs = []
for i in range(0, len(df.columns), 3):
dfs.append(df[df.columns[i : i + 3]])
dfs[-1].columns = df.columns[:3]
df_out = pd.concat(dfs)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
0 d e f
1 10 13 16
2 11 14 17
3 12 15 18
0 g h i
1 19 22 25
2 20 23 26
3 21 24 27

Normalize IDs column

I'm making a recommender system, and I'd like to have a matrix of ratings (User/Item). My problem is there are only 9066 unique items in the dataset, but their IDs range from 1 to 165201. So I need a way to map the IDs to be in the range of 1 to 9066, instead of 1 to 165201. How do I do that?
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
User=np.random.randint(10, size=20),
Item=np.random.randint(100, size=20)
))
print(df)
Item User
0 27 0
1 77 2
2 54 7
3 39 3
4 23 8
5 84 7
6 37 0
7 99 6
8 87 8
9 37 6
10 63 0
11 25 2
12 11 0
13 71 4
14 44 9
15 70 7
16 4 3
17 71 2
18 63 4
19 86 3
Use unique to get unique values and build a mapping dictionary
u = df.Item.unique()
m = dict(zip(u, range(len(u))))
Then use map to produce the re configured column
df.assign(Item=df.Item.map(m))
Item User
0 0 0
1 1 2
2 2 7
3 3 3
4 4 8
5 5 7
6 6 0
7 7 6
8 8 8
9 6 6
10 9 0
11 10 2
12 11 0
13 12 4
14 13 9
15 14 7
16 15 3
17 12 2
18 9 4
19 16 3
​Or we could have accomplished the same thing with pd.factorize
df.assign(Item=pd.factorize(df.Item)[0])
Item User
0 0 0
1 1 2
2 2 7
3 3 3
4 4 8
5 5 7
6 6 0
7 7 6
8 8 8
9 6 6
10 9 0
11 10 2
12 11 0
13 12 4
14 13 9
15 14 7
16 15 3
17 12 2
18 9 4
19 16 3
I would go through and find the item with the smallest id in the list, set it to 1, then find the next smallest, set it to 2, and so on.
edit: you are right. That would take way too long. I would just go through and set one of them to 1, the next one to 2, and so on. It doesn't matter what order the ids are in (I am guessing). When a new item is added just set it to 9067, and so on.

Remove duplicates but keep some

Is it possible to remove duplicates but keep last 3-4 ? Something like:
df = df.drop_duplicates(['ID'], keep='last_four')
Thank you
You can use groupby and tail and pass the num of rows you wish to keep to achieve the same result:
In [5]:
# data setup
df = pd.DataFrame({'ID':[0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,4], 'val':np.arange(25)})
df
Out[5]:
ID val
0 0 0
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
15 3 15
16 3 16
17 3 17
18 3 18
19 3 19
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24
Now groupby and call tail:
In [11]:
df.groupby('ID',as_index=False).tail(4)
Out[11]:
ID val
2 0 2
3 0 3
4 0 4
5 0 5
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24

Categories