I have a dataframe that looks like this:
userId movie1 movie2 movie3 movie4 score
0 4.1 2.1 1.0 NaN 2
1 3.1 1.1 3.4 1.4 1
2 2.8 NaN 1.7 NaN 3
3 NaN 5.0 NaN 2.3 4
4 NaN NaN NaN NaN 1
5 2.3 NaN 2.0 4.0 1
I want to subtract the movie scores from each movie so the output would look like this:
userId movie1 movie2 movie3 movie4 score
0 2.1 0.1 -1.0 NaN 2
1 2.1 0.1 2.4 0.4 1
2 -0.2 NaN -2.3 NaN 3
3 NaN 1.0 NaN -1.7 4
4 NaN NaN NaN NaN 1
5 1.3 NaN 1.0 3.0 1
The actual dataframe has thousands of movies and the movies are referenced by name so im trying to find a solution to comply with that.
I should have also mention that the movies are not listed in order like ["movie1", "movie2", "movie3"], they are listed by their titles instead like ["Star Wars", "Harry Potter", "Lord of the Rings"]. The dataset could be changed so I wont know what the last movie in the list is.
Use df.filter to identify the movie columns and then subtract these columns from score array:
In [35]: x = df.filter(like='movie', axis=1).columns.tolist()
In [36]: df[x] = df.filter(like='movie', axis=1) - df.score.values[:, None]
In [37]: df
Out[37]:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
EDIT: When the movie column names are random. Select all columns except 'userId', 'score':
x = df.columns[~df.columns.isin(['userId', 'score'])]
df[x] = df[x] - df.score.values[:, None]
A possible solution
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['userId'] = [0 , 1 , 2 , 3 , 4 , 5 ]
df['movie1'] = [4.1 , 3.1, 2.8 , np.nan, np.nan, 2.3 ]
df['movie2'] = [2.1 , 1.1, np.nan, 5.0 , np.nan, np.nan]
df['movie3'] = [1.0 , 3.4, 1.7 , np.nan, np.nan, 2.0 ]
df['movie4'] = [np.nan, 1.4, np.nan, 2.3 , np.nan, 4.0 ]
df['score'] = [2, 1, 3, 4, 5, 6]
print('before = ', df)
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.iloc[:,-1].values, axis='rows')
print('after = ', df)
It should return
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
You can use NumPy broadcasting to subtract here.
v = df.loc[:, 'movie1':'movie4'].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, 'movie1':'movie4'] = out
df
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
If you don't know column names use pd.Index.difference here.
cols = df.columns.difference(['userId', 'score'])
# Every column name is extracted expect for 'userId' and 'score'
cols
# Index(['movie1', 'movie2', 'movie3', 'movie4'], dtype='object')
Now, just replace 'movie1':'movie4' with cols.
v = df.loc[:, cols].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, cols] = out
Solution without using .apply():
df.iloc[:, 1:5] = (
df.iloc[:, 1:5]
- df['score'].values.reshape(-1, 1)
)
You can select the columns with iloc if the names of the columns are unknown and use the sub function from pandas to avoid converting to numpy or using apply. I'm assuming value [2,'movie3'] is a typo in your expected output.
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.score, axis=0)
df
Out:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 1
5 5 1.3 NaN 1.0 3.0 1
df.loc[:, "movie1":"movie4"] = df.loc[:, "movie1":"movie4"].apply(
lambda x: x - df["score"]
)
print(df)
Prints:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
How do I write a python code for calculating values for a new column --"UEC_saving" by OFFSET # of rows in the "UEC"column. The table format is pandas dataframe.
Positive numbers in the "Offset_rows" column means shift rows down and vice versa.
For example, "UEC_saving" for index 0 is 8.6-7.2 and for index 2 is 0.2-7.0
The output for "UEC_saving" should look like this:
Product_Class UEC Offset_rows UEC_saving
0 PC1 8.6 1 1.4
1 PC1 7.2 0 0.0
2 PC1 0.2 -1 -7.0
3 PC2 18.8 2 2.2
4 PC2 10.0 1 1.4
5 PC2 8.6 0 0.0
6 PC2 0.3 -1 -8.3
you can do:
for i,row in df.iterrows():
df.at[i,'UEC_saving'] = row['UEC'] - df.loc[i+row['Offset_rows']]['UEC']
df
UEC Offset_rows UEC_saving
0 8.6 1 1.4
1 7.2 0 0.0
2 0.2 -1 -7.0
3 18.8 2 10.2
4 10.0 1 1.4
5 8.6 0 0.0
6 0.3 -1 -8.3
this lines up with all your desired answers except for index 3, please let me know if there was a typo, or please explain further in your question exactly what occurs there
I have 2 dataframe:
df1
aa gg pm
1 3.3 0.5
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5
6 3.3 0.5
7 11.1 3.0
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3
8 3.3 0.3
and df2:
aa gg st
1 3.3 in
2 0.3 in
5 1.3 in
7 11.1 in
8 5.3 in
I would like to merge these two dataframe on col aa and gg to get results like:
aa gg pm st
1 3.3 0.5 in
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6 in
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5 in
6 3.3 0.5
7 11.1 3.0 in
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3 in
8 3.3 0.3
I want to map the col st details to based on col aa and gg.
please let me know how to do this.
You can multiple float columns by 1000 or 10000 and convert to integers and then use these new columns for join:
df1['gg_int'] = df1['gg'].mul(1000).astype(int)
df2['gg_int'] = df2['gg'].mul(1000).astype(int)
df = df1.merge(df2.drop('gg', axis=1), on=['aa','gg_int'], how='left')
df = df.drop('gg_int', axis=1)
print (df)
aa gg pm st
0 1 3.3 0.5 in
1 1 0.0 4.7 NaN
2 1 9.3 0.2 NaN
3 2 0.3 0.6 in
4 2 14.0 91.0 NaN
5 3 13.0 31.0 NaN
6 4 13.1 64.0 NaN
7 5 1.3 0.5 in
8 6 3.3 0.5 NaN
9 7 11.1 3.0 in
10 7 11.3 24.0 NaN
11 8 3.2 0.0 NaN
12 8 5.3 0.3 in
13 8 3.3 0.3 NaN
I created the DataFrame and faced a problem:
r value
0 0.8 2.5058
1 0.9 -1.9320
2 1.0 -2.6097
3 1.2 -1.6840
4 1.4 -0.8906
5 0.8 2.6955
6 0.9 -1.9552
7 1.0 -2.6641
8 1.2 -1.7169
9 1.4 -0.9056
... ... ...
For r from 0.8 to 1.4, I want to assign the value for r = 1.0.
Therefore the desired Dataframe should look like:
r value
0 0.8 -2.6097
1 0.9 -2.6097
2 1.0 -2.6097
3 1.2 -2.6097
4 1.4 -2.6097
5 0.8 -2.6641
6 0.9 -2.6641
7 1.0 -2.6641
8 1.2 -2.6641
9 1.4 -2.6641
... ... ....
My first idea wast to create the condition:
np.where(data['r']==1.0, data['value'], 1.0)
but it does not solve my problem.
Try this:
def subr(df):
isone = df.r == 1.0
if isone.any():
atone = df.value[isone].iloc[0]
# Improvement suggested by #root
df.loc[df.r.between(0.8, 1.4), 'value'] = atone
# df.loc[(df.r >= .8) & (df.r <= 1.4), 'value'] = atone
return df
df.groupby((df.r < df.r.shift()).cumsum()).apply(subr)
Starting with this:
r value
0 0.8 -2.6097
1 0.9 -2.6097
2 1.0 -2.6097
3 1.2 -2.6097
4 1.4 -2.6097
5 0.8 -2.6641
6 0.9 -2.6641
7 1.0 -2.6641
8 1.2 -2.6641
9 1.4 -2.6641
df3['grp'] = (df3['r'] ==.8).cumsum()
grpd = dict(df3[['grp','value']][df3['r'] == 1].values)
df3["value"] = df3["grp"].map(grpd)
df3 = df3.drop('grp', axis=1)
r value
0 0.8 -2.6097
1 0.9 -2.6097
2 1.0 -2.6097
3 1.2 -2.6097
4 1.4 -2.6097
5 0.8 -2.6641
6 0.9 -2.6641
7 1.0 -2.6641
8 1.2 -2.6641
9 1.4 -2.6641
Given the following data frame:
index value
1 0.8
2 0.9
3 1.0
4 0.9
5 nan
6 nan
7 nan
8 0.4
9 0.9
10 nan
11 0.8
12 2.0
13 1.4
14 1.9
15 nan
16 nan
17 nan
18 8.4
19 9.9
20 10.0
…
in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 0
11 0.8 0.8
12 2.0 2.8
13 1.4 4.2
14 1.9 6.1
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
…
Any suggestions?
Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 1.3
11 0.8 2.1
12 2.0 4.1
13 1.4 5.5
14 1.9 7.4
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
Thank you for the help!
You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:
n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)
to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)
produces
>>> df
index value cumsum cumsum_skip1
0 1 0.8 0.8 0.8
1 2 0.9 1.7 1.7
2 3 1.0 2.7 2.7
3 4 0.9 3.6 3.6
4 5 NaN 0.0 0.0
5 6 NaN 0.0 0.0
6 7 NaN 0.0 0.0
7 8 0.4 0.4 0.4
8 9 0.9 1.3 1.3
9 10 NaN 0.0 1.3
10 11 0.8 0.8 2.1
11 12 2.0 2.8 4.1
12 13 1.4 4.2 5.5
13 14 1.9 6.1 7.4
14 15 NaN 0.0 0.0
15 16 NaN 0.0 0.0
16 17 NaN 0.0 0.0
17 18 8.4 8.4 8.4
18 19 9.9 18.3 18.3
19 20 10.0 28.3 28.3