I have following dataframe(short sample):
cond_ID tow1_ID tow2_ID
0 10 0 4
1 17 6 12
3 13 14 15
4 13 16 16
5 13 17 18
I want to extend it based on range between tow1_ID and tow2_ID. For instance, I want to add records with values 1,2,3 and 4 below value 0. Here is desired output:
cond_ID tow1_ID
0 10 0
0 10 1
0 10 2
0 10 3
0 10 4
1 17 6
1 17 7
1 17 8
1 17 9
1 17 10
1 17 11
1 17 12
1 13 14
1 13 15
1 13 16
1 13 17
1 13 18
How can I do this with vectorized approach ( without using apply ) ? Any help is highly appreciated.
Try this:
df.assign(tow1_ID=[np.arange(s,f+1) for s, f in zip(df['tow1_ID'], df['tow2_ID'])])\
.explode('tow1_ID')\
.drop(['tow2_ID'], axis=1)
Output:
cond_ID tow1_ID
0 10 0
0 10 1
0 10 2
0 10 3
0 10 4
1 17 6
1 17 7
1 17 8
1 17 9
1 17 10
1 17 11
1 17 12
3 13 14
3 13 15
4 13 16
5 13 17
5 13 18
def foo(r):
return pd.DataFrame({"cond_ID": r.cond_ID,
"tow_ID": range(r.tow1_ID, r.tow2_ID + 1),
"index": r.name}).set_index("index")
print(pd.concat([foo(r) for _, r in df.iterrows()]))
# cond_ID tow_ID
# index
# 0 10 0
# 0 10 1
# 0 10 2
# 0 10 3
# 0 10 4
# 1 17 6
# 1 17 7
# 1 17 8
# 1 17 9
# 1 17 10
# 1 17 11
# 1 17 12
# 3 13 14
# 3 13 15
# 4 13 16
# 5 13 17
# 5 13 18
Related
I am currently working on a large DF and need to reference the data in a column for a rolling window calculation. All rows have a separate rolling window value so i need to reference the column but i am getting the out put
ValueError: window must be an integer 0 or greater
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,20,size=(20, 4)), columns=list('abcd'))
df['op'] = (np.random.randint(0,20, size=20))
a b c d op
0 6 17 3 5 9
1 8 3 13 7 2
2 19 12 18 3 8
3 8 8 5 4 17
4 0 5 9 3 19
5 0 5 19 9 11
6 7 7 13 8 10
7 7 5 12 0 4
8 13 17 4 4 17
9 7 0 16 9 7
10 7 8 13 10 13
11 18 3 1 11 16
12 4 4 5 13 4
13 9 8 14 19 9
14 13 10 10 7 10
15 9 16 11 16 3
16 5 7 3 0 11
17 13 14 10 1 16
18 6 14 13 4 18
19 1 9 8 0 19
trying to reference the value in df['op'] for a rolling average.
df['SMA'] = df.a.rolling(window=df.op).mean()
produces Error ValueError: window must be an integer 0 or greater
As mentioned i am working on a large data frame so the above is example code.
Solution
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,20,size=(20, 4)),
columns=list('abcd'))
df['op'] = (np.random.randint(0,20, size=20))
def lookback_window(row, values, lookback, method='mean', *args, **kwargs):
loc = values.index.get_loc(row.name)
lb = lookback.loc[row.name]
return getattr(values.iloc[loc - lb: loc + 1], method)(*args, **kwargs)
df['SMA'] = df.apply(lookback_window, values=df['a'], lookback=df['op'], axis=1)
df
a b c d op SMA
0 17 19 11 9 0 17.000000
1 0 10 9 11 19 NaN
2 13 8 11 2 16 NaN
3 9 2 4 4 8 NaN
4 11 10 0 17 18 NaN
5 14 19 17 10 17 NaN
6 6 12 17 1 4 10.600000
7 10 1 3 18 2 10.000000
8 7 6 12 3 19 NaN
9 1 9 7 5 9 8.800000
10 17 1 3 13 1 9.000000
11 19 17 0 2 7 10.625000
12 18 5 2 4 12 10.923077
13 18 5 4 2 1 18.000000
14 5 11 17 11 11 11.250000
15 16 9 2 11 16 NaN
16 15 17 1 8 14 11.933333
17 15 2 0 3 6 15.142857
18 18 3 18 3 10 13.545455
19 7 0 12 15 3 13.750000
i do have a Pandas df like (df1):
0 1 2 3 4 5
0 a b c d e f
1 1 4 7 10 13 16
2 2 5 8 11 14 17
3 3 6 9 12 15 18
and i want to generate an Dataframe like (df2):
0 1 2
0 a b c
1 1 4 7
2 2 5 7
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
additional information about the given df:
shape of given df ist unknown. b = df1.shape() -> b = [n,m]
it is a given fact the width of df1 is divisble by 3
i did try stack, melt and wide_to_long. By using stack the order of the rows is lost, the rows should behave as shown in exmeplary df2 . I would really appreciate any help.
Kind regards Hans
Use np.vstack and np.hsplit:
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
Another example:
>>> df
0 1 2 3 4 5 6 7 8
0 a b c d e f g h i
1 1 4 7 10 13 16 19 22 25
2 2 5 8 11 14 17 20 23 26
3 3 6 9 12 15 18 21 24 27
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
8 g h i
9 19 22 25
10 20 23 26
11 21 24 27
You can use DataFrame.append:
a = df[df.columns[: len(df.columns) // 3 + 1]]
b = df[df.columns[len(df.columns) // 3 + 1 :]]
b.columns = a.columns
df_out = a.append(b).reset_index(drop=True)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
EDIT: To handle unknown widths:
dfs = []
for i in range(0, len(df.columns), 3):
dfs.append(df[df.columns[i : i + 3]])
dfs[-1].columns = df.columns[:3]
df_out = pd.concat(dfs)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
0 d e f
1 10 13 16
2 11 14 17
3 12 15 18
0 g h i
1 19 22 25
2 20 23 26
3 21 24 27
I have a DataFrame which holds two columns like below:
player_id days
0 None 1
1 None 1
2 None 1
3 None 1
4 None 1
5 None 1
6 None 2
7 None 2
8 None 2
9 None 2
10 None 2
.
.
82 None 13
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
83 None 14
in output, I need to replace None with the id of players which is 1 to 11, have something like:
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
11 1 2
12 2 2
13 3 2
14 4 2
.
.
82 5 13
83 6 14
83 7 14
83 8 14
83 9 14
83 10 14
83 11 14
this is my code:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = i
print(df)
however I get the following dataframe:
player_id days
0 11 1
1 11 1
2 11 1
3 11 1
4 11 1
5 11 1
6 11 2
7 11 2
8 11 2
9 11 2
10 11 2
11 11 2
12 11 2
13 11 2
14 11 2
.
.
82 11 13
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
83 11 14
I also tried to add a new series as follows, but does not work:
for index in range(len(df)):
for i in range(1, 11):
df.iloc[index, 0] = pd.Series([i, df['day']], index=['player_id', 'day'])
print(df)
I have some doubt if editing a filed in dataframe is possible or not, I just skipped itertuples and iterrows to be able to edit this rows in an efficient way.
try % operator:
import numpy as np
df['player_id'] = 1 + np.arange(len(df))%11
df
output
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 2
7 8 2
8 9 2
9 10 2
10 11 2
82 1 13
83 2 14
83 3 14
83 4 14
83 5 14
83 6 14
83 7 14
Edit: using index
if the df's index (the first column in the output above) is not sequential and you want the same pattern but based on the index, then you can do
df['player_id'] = 1 + df.index%11
This can be done as.
i=0
for index in range(len(df)):
df.iloc[index, 0] = 1+i%11
i+=1
print(df)
player_id days
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
5 6 1
6 7 1
7 8 1
8 9 1
9 10 1
10 11 1
11 1 2
12 2 2
13 3 2
14 4 2
15 5 2
16 6 2
17 7 2
18 8 2
19 9 2
20 10 2
21 11 2
22 1 3
23 2 3
24 3 3
25 4 3
26 5 3
27 6 3
28 7 3
29 8 3
30 9 3
31 10 3
32 11 3
I'm making a recommender system, and I'd like to have a matrix of ratings (User/Item). My problem is there are only 9066 unique items in the dataset, but their IDs range from 1 to 165201. So I need a way to map the IDs to be in the range of 1 to 9066, instead of 1 to 165201. How do I do that?
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
User=np.random.randint(10, size=20),
Item=np.random.randint(100, size=20)
))
print(df)
Item User
0 27 0
1 77 2
2 54 7
3 39 3
4 23 8
5 84 7
6 37 0
7 99 6
8 87 8
9 37 6
10 63 0
11 25 2
12 11 0
13 71 4
14 44 9
15 70 7
16 4 3
17 71 2
18 63 4
19 86 3
Use unique to get unique values and build a mapping dictionary
u = df.Item.unique()
m = dict(zip(u, range(len(u))))
Then use map to produce the re configured column
df.assign(Item=df.Item.map(m))
Item User
0 0 0
1 1 2
2 2 7
3 3 3
4 4 8
5 5 7
6 6 0
7 7 6
8 8 8
9 6 6
10 9 0
11 10 2
12 11 0
13 12 4
14 13 9
15 14 7
16 15 3
17 12 2
18 9 4
19 16 3
Or we could have accomplished the same thing with pd.factorize
df.assign(Item=pd.factorize(df.Item)[0])
Item User
0 0 0
1 1 2
2 2 7
3 3 3
4 4 8
5 5 7
6 6 0
7 7 6
8 8 8
9 6 6
10 9 0
11 10 2
12 11 0
13 12 4
14 13 9
15 14 7
16 15 3
17 12 2
18 9 4
19 16 3
I would go through and find the item with the smallest id in the list, set it to 1, then find the next smallest, set it to 2, and so on.
edit: you are right. That would take way too long. I would just go through and set one of them to 1, the next one to 2, and so on. It doesn't matter what order the ids are in (I am guessing). When a new item is added just set it to 9067, and so on.
Is it possible to remove duplicates but keep last 3-4 ? Something like:
df = df.drop_duplicates(['ID'], keep='last_four')
Thank you
You can use groupby and tail and pass the num of rows you wish to keep to achieve the same result:
In [5]:
# data setup
df = pd.DataFrame({'ID':[0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,4], 'val':np.arange(25)})
df
Out[5]:
ID val
0 0 0
1 0 1
2 0 2
3 0 3
4 0 4
5 0 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
15 3 15
16 3 16
17 3 17
18 3 18
19 3 19
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24
Now groupby and call tail:
In [11]:
df.groupby('ID',as_index=False).tail(4)
Out[11]:
ID val
2 0 2
3 0 3
4 0 4
5 0 5
9 1 9
10 1 10
11 1 11
12 1 12
13 2 13
14 2 14
20 3 20
21 3 21
22 3 22
23 3 23
24 4 24