Subtract from every value in a DataFrame

Subtract from every value in a DataFrame - python

I have a dataframe that looks like this:
userId movie1 movie2 movie3 movie4 score
0 4.1 2.1 1.0 NaN 2
1 3.1 1.1 3.4 1.4 1
2 2.8 NaN 1.7 NaN 3
3 NaN 5.0 NaN 2.3 4
4 NaN NaN NaN NaN 1
5 2.3 NaN 2.0 4.0 1
I want to subtract the movie scores from each movie so the output would look like this:
userId movie1 movie2 movie3 movie4 score
0 2.1 0.1 -1.0 NaN 2
1 2.1 0.1 2.4 0.4 1
2 -0.2 NaN -2.3 NaN 3
3 NaN 1.0 NaN -1.7 4
4 NaN NaN NaN NaN 1
5 1.3 NaN 1.0 3.0 1
The actual dataframe has thousands of movies and the movies are referenced by name so im trying to find a solution to comply with that.
I should have also mention that the movies are not listed in order like ["movie1", "movie2", "movie3"], they are listed by their titles instead like ["Star Wars", "Harry Potter", "Lord of the Rings"]. The dataset could be changed so I wont know what the last movie in the list is.

Use df.filter to identify the movie columns and then subtract these columns from score array:
In [35]: x = df.filter(like='movie', axis=1).columns.tolist()
In [36]: df[x] = df.filter(like='movie', axis=1) - df.score.values[:, None]
In [37]: df
Out[37]:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
EDIT: When the movie column names are random. Select all columns except 'userId', 'score':
x = df.columns[~df.columns.isin(['userId', 'score'])]
df[x] = df[x] - df.score.values[:, None]

A possible solution
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['userId'] = [0 , 1 , 2 , 3 , 4 , 5 ]
df['movie1'] = [4.1 , 3.1, 2.8 , np.nan, np.nan, 2.3 ]
df['movie2'] = [2.1 , 1.1, np.nan, 5.0 , np.nan, np.nan]
df['movie3'] = [1.0 , 3.4, 1.7 , np.nan, np.nan, 2.0 ]
df['movie4'] = [np.nan, 1.4, np.nan, 2.3 , np.nan, 4.0 ]
df['score'] = [2, 1, 3, 4, 5, 6]
print('before = ', df)
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.iloc[:,-1].values, axis='rows')
print('after = ', df)
It should return
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6

You can use NumPy broadcasting to subtract here.
v = df.loc[:, 'movie1':'movie4'].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, 'movie1':'movie4'] = out
df
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
If you don't know column names use pd.Index.difference here.
cols = df.columns.difference(['userId', 'score'])
# Every column name is extracted expect for 'userId' and 'score'
cols
# Index(['movie1', 'movie2', 'movie3', 'movie4'], dtype='object')
Now, just replace 'movie1':'movie4' with cols.
v = df.loc[:, cols].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, cols] = out

Solution without using .apply():
df.iloc[:, 1:5] = (
df.iloc[:, 1:5]
- df['score'].values.reshape(-1, 1)
)

You can select the columns with iloc if the names of the columns are unknown and use the sub function from pandas to avoid converting to numpy or using apply. I'm assuming value [2,'movie3'] is a typo in your expected output.
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.score, axis=0)
df
Out:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 1
5 5 1.3 NaN 1.0 3.0 1

df.loc[:, "movie1":"movie4"] = df.loc[:, "movie1":"movie4"].apply(
lambda x: x - df["score"]
)
print(df)
Prints:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6

Related

Rolling mean on a groupby with min_periods=1, but NOT ignoring NANs

So far, I've only seen questions being asked about how to ignore NANs while doing a rolling mean on a groupby. But my case is the opposite. I want to include the NANs such that if even one of the values in the rolling windows is NAN, I want the resulting rolling mean to be NAN as well.
Input:
grouping value_to_avg
0 1 1.0
1 1 2.0
2 1 3.0
3 1 NaN
4 1 4.0
5 2 5.0
6 2 NaN
7 2 6.0
8 2 7.0
9 2 8.0
Code to create sample input:
data = {'grouping': [1,1,1,1,1,2,2,2,2,2], 'value_to_avg': [1,2,3,np.nan,4,5,np.nan,6,7,8]}
db = pd.DataFrame(data)
Code that I have tried:
db['rolling_mean_actual'] = db.groupby('grouping')['value_to_avg'].transform(lambda s: s.rolling(window=3, center=True, min_periods=1).mean(skipna=False))
Actual vs. expected output:
grouping value_to_avg rolling_mean_actual rolling_mean_expected
0 1 1.0 1.5 1.5
1 1 2.0 2.0 2.0
2 1 3.0 2.5 NaN
3 1 NaN 3.5 NaN
4 1 4.0 4.0 NaN
5 2 5.0 5.0 NaN
6 2 NaN 5.5 NaN
7 2 6.0 6.5 NaN
8 2 7.0 7.0 7.0
9 2 8.0 7.5 7.5
You can see above, using skipna=False inside the mean function does not work as expected and still ignores NANs

For me working custom function with np.mean with convert values to numpy array:
roll_window = 3
db['rolling_mean_actual'] = (db.groupby('grouping')['value_to_avg']
.transform(lambda s: s.rolling(roll_window,
center=True,
min_periods=1)
.apply(lambda x: np.mean(x.to_numpy())))
You can avoid transform also:
roll_window = 3
db['rolling_mean_actual'] = (db.groupby('grouping')['value_to_avg']
.rolling(roll_window, center=True, min_periods=1)
.apply(lambda x: np.mean(x.to_numpy()))
.droplevel(0))
print (db)
grouping value_to_avg rolling_mean_actual
0 1 1.0 1.5
1 1 2.0 2.0
2 1 3.0 NaN
3 1 NaN NaN
4 1 4.0 NaN
5 2 5.0 NaN
6 2 NaN NaN
7 2 6.0 NaN
8 2 7.0 7.0
9 2 8.0 7.5

import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"grouping": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"value_to_avg": [1, 2, 3, np.nan, 4, 5, np.nan, 6, 7, 8],
}
)
pd.concat(
[
df,
df.groupby("grouping", as_index=False)
.rolling(window=3, center=True, min_periods=0)
.apply(lambda x: x.mean() if ~x.isna().any() else np.NaN).rename(columns={'value_to_avg': 'rolling avg'}),
],
axis=1,
).iloc[:, [0, 1, 3]]
>>>
grouping value_to_avg rolling avg
0 1 1.0 1.5
1 1 2.0 2.0
2 1 3.0 NaN
3 1 NaN NaN
4 1 4.0 NaN
5 2 5.0 NaN
6 2 NaN NaN
7 2 6.0 NaN
8 2 7.0 7.0
9 2 8.0 7.5

Pandas: Value Counts based on condition

I want to generate a column count that counts the value of pts group by id. Condition is if x and y both contain NaN corresponding pts will be counted, otherwise it will be ignored.
Sample Df:
id pts x y
0 1 0.1 NaN NaN
1 1 0.2 1.0 NaN
2 1 1.1 NaN NaN
3 2 0.1 NaN NaN
4 2 0.2 2.0 1.0
5 3 1.1 NaN NaN
6 3 1.2 NaN 5.0
7 3 3.1 NaN NaN
8 3 3.2 NaN NaN
9 4 0.1 NaN NaN
Expected df:
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1
I tried:
df['count'] = df.groupby(['id'])['pts'].value_counts()

You can test if missing values in both Dataframes by DataFrame.isna and DataFrame.all and then count Trues values by sum for new column in GroupBy.transform:
df['count'] = df[['x','y']].isna().all(axis=1).groupby(df['id']).transform('sum')
print (df)
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1
Or chain both masks by & for bitwise AND:
df['count'] = (df['x'].isna() & df['y'].isna()).groupby(df['id']).transform('sum')
print (df)
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1

Pandas Reshape DataFrame in a loop

I am new to Pandas in python. I have a dataframe with 2 keys 15 rows each and 1 column like below
1
key1/1 0.5
key1/2 0.5
key1/3 0
key1/4 0
key1/5 0.6
key1/6 0.7
key1/7 0
key1/8 0
key1/9 0
key1/10 0.5
key1/11 0.5
key1/12 0.5
key1/13 0
key1/14 0.5
key1/15 0.5
key2/1 0.4
key2/2 0.2
key2/3 0
key2/4 0
key2/5 0.1
key2/6 0.2
key2/7 0
key2/8 0
key2/9 0.3
key2/10 0.2
key2/11 0
key2/12 0.5
key2/13 0
key2/14 0
key2/15 0.5
I want to iterate the rows of the dataframe so each time it meets a 'zero' it creates a new column like below
1 2 3 4
key1/1 0.5 0.6 0.5 0.5
key1/2 0.5 0.7 0.5 0.5
key1/3 nan nan 0.5 nan
key1/4 nan nan nan nan
1 2 3 4 5
key2/1 0.4 0.1 0.3 0.5 0.5
key2/2 0.2 0.2 0.2 nan nan
key2/3 nan nan nan nan nan
key2/4 nan nan nan nan nan
I have tried the following code trying to iterate 'key1' only
df2=pd.Dataframe[]
for row in df['key1'].index:
new_df['keyl'][row] == df['keyl'][row]
if df['keyl'][row] == 0:
new_df['key1'].append(df2,ignore_index=True)
Obviously it is not working, please send some help. Ideally I would like to modify the same dataframe instead of creating a new one. Thanks
EDIT
Below is a drawing of how my data looks like
And below is what I am trying to achieve

You can use mask them by zero and assign a key. Based on the key you can group them and transform them to columns.
All credit goes to this answer. You will find a great explanation there.
df2 = df.mask((df['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
df2.pivot(columns='group')
1
group 1 2 3 4
key1/1 0.5 NaN NaN NaN
key1/10 NaN NaN 0.5 NaN
key1/11 NaN NaN 0.5 NaN
key1/12 NaN NaN 0.5 NaN
key1/14 NaN NaN NaN 0.5
key1/15 NaN NaN NaN 0.5
key1/2 0.5 NaN NaN NaN
key1/5 NaN 0.6 NaN NaN
key1/6 NaN 0.7 NaN NaN
Your group key will look like this:
1 group
key1/1 0.5 1
key1/2 0.5 1
key1/3 NaN 1
key1/4 NaN 1
key1/5 0.6 2
key1/6 0.7 2
key1/7 NaN 2
key1/8 NaN 2
key1/9 NaN 2
key1/10 0.5 3
key1/11 0.5 3
key1/12 0.5 3
key1/13 NaN 3
key1/14 0.5 4
key1/15 0.5 4
This data you can translate it into column format.
Complete solution:
df2 = df.mask((df['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
x = df2.groupby('group')['1'].apply(list)
df3 = pd.DataFrame(x.values.tolist()).T
df3.index = [f"key1/{i}" for i in range(1,len(df3)+1)]
0 1 2 3
key1/1 0.5 0.6 0.5 0.5
key1/2 0.5 0.7 0.5 0.5
key1/3 NaN NaN 0.5 NaN
If you want something in that format you need to have data like this:
group
1 [0.5, 0.5]
2 [0.6, 0.7]
3 [0.5, 0.5, 0.5]
4 [0.5, 0.5]
Name: 1, dtype: object
Update1:
Assuming data to be:
def func(r):
df2 = r.mask((r['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
x = df2.groupby('group')['1'].apply(list)
df3 = pd.DataFrame(x.values.tolist()).T
# df3.index = [r.name]*len(df3)
return (df3)
df.groupby(df.index).apply(func)

How to reset cumulative sum every time there is a NaN in a pandas dataframe?

If I have a Pandas data frame like this:
1 2 3 4 5 6 7
1 NaN 1 1 1 NaN 1 1
2 NaN NaN 1 1 1 1 1
3 NaN NaN NaN 1 NaN 1 1
4 1 1 NaN NaN 1 1 NaN
How do I do a cumulative sum such that the count resets every time there is a NaN value in the row? Such that I get something like this:
1 2 3 4 5 6 7
1 NaN 1 2 3 NaN 1 2
2 NaN NaN 1 2 3 4 5
3 NaN NaN NaN 1 NaN 1 2
4 1 2 NaN NaN 1 2 NaN

You could do:
# compute mask where np.nan = True
mask = pd.isna(df).astype(bool)
# compute cumsum across rows fillna with ffill
cumulative = df.cumsum(1).fillna(method='ffill', axis=1).fillna(0)
# get the values of cumulative where nan is True use the same method
restart = cumulative[mask].fillna(method='ffill', axis=1).fillna(0)
# set the result
result = (cumulative - restart)
result[mask] = np.nan
# display the result
print(result)
Output
1 2 3 4 5 6 7
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN

You can do with stack and unstack
s=df.stack(dropna=False).isnull().cumsum()
df=df.where(df.isnull(),s.groupby(s).cumcount().unstack())
df
Out[86]:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1 2.0
2 NaN NaN 1.0 2.0 3.0 4 5.0
3 NaN NaN NaN 1.0 NaN 1 2.0
4 3.0 4.0 NaN NaN 1.0 2 NaN

I came up with a slightly different answer here that might be helpful.
For as single series I made this function to to do the cumsum-reset on nulls.
def cumsum_reset_on_null(srs: pd.Series) -> pd.Series:
"""
For a pandas series with null values,
do a cumsum and reset the cumulative sum when a null value is encountered.
Example)
input: [1, 1, np.nan, 1, 2, 3, np.nan, 1, np.nan]
return: [1, 2, 0, 1, 3, 6, 0, 1, 0]
"""
cumulative = srs.cumsum().fillna(method='ffill')
restart = ((cumulative * srs.isnull()).replace(0.0, np.nan)
.fillna(method='ffill').fillna(0))
result = (cumulative - restart)
return result.replace(0, np.nan)
Then for the full dataframe, just apply this function row-wise
df = pd.DataFrame([
[np.nan, 1, 1, 1, np.nan, 1, 1],
[np.nan, np.nan, 1, 1, 1, 1, 1],
[np.nan, np.nan, np.nan, 1, np.nan, 1, 1],
[1, 1, np.nan, np.nan, 1, 1, np.nan],
])
df.apply(cumsum_reset_on_null, axis=1)
0 NaN 1.0 2.0 3.0 NaN 1.0 2.0
1 NaN NaN 1.0 2.0 3.0 4.0 5.0
2 NaN NaN NaN 1.0 NaN 1.0 2.0
3 1.0 2.0 NaN NaN 1.0 2.0 NaN

One of the way can be:
sample = pd.DataFrame({1:[np.nan,np.nan,np.nan,1],2:[1,np.nan,np.nan,1],3:[1,1,np.nan,np.nan],4:[1,1,1,np.nan],5:[np.nan,1,np.nan,1],6:[1,1,1,1],7:[1,1,1,np.nan]},index=[1,2,3,4])
Output of sample
1 2 3 4 5 6 7
1 NaN 1.0 1.0 1.0 NaN 1 1.0
2 NaN NaN 1.0 1.0 1.0 1 1.0
3 NaN NaN NaN 1.0 NaN 1 1.0
4 1.0 1.0 NaN NaN 1.0 1 NaN
Following code would do:
#numr = number of rows
#numc = number of columns
numr,numc = sample.shape
for i in range(numr):
s=0
flag=0
for j in range(numc):
if np.isnan(sample.iloc[i,j]):
flag=1
else:
if flag==1:
s=sample.iloc[i,j]
flag=0
else:
s+=sample.iloc[i,j]
sample.iloc[i,j]=s
Output:
1 2 3 4 5 6 7
1 NaN 1.0 2.0 3.0 NaN 1.0 2.0
2 NaN NaN 1.0 2.0 3.0 4.0 5.0
3 NaN NaN NaN 1.0 NaN 1.0 2.0
4 1.0 2.0 NaN NaN 1.0 2.0 NaN

converting sparse dataframe to dense dataframe

I have sparse data stored in a dataframe:
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
a b data
0 1 2 -0.824022
1 3 5 0.503239
2 5 5 -0.540105
Since I care about the null data the actual data would look like this:
true_df
a b data
0 1 1 NaN
1 1 2 -0.824022
2 1 3 NaN
3 1 4 NaN
4 1 5 NaN
5 2 1 NaN
6 2 2 NaN
7 2 3 NaN
8 2 4 NaN
9 2 5 NaN
10 3 1 NaN
11 3 2 NaN
12 3 3 NaN
13 3 4 NaN
14 3 5 0.503239
15 4 1 NaN
16 4 2 NaN
17 4 3 NaN
18 4 4 NaN
19 4 5 NaN
20 5 1 NaN
21 5 2 NaN
22 5 3 NaN
23 5 4 NaN
24 5 5 -0.540105
My question is how do I construct true_df? I was hoping there was some way to use pd.concat or pd.merge, that is, construct a dataframe that is the shape of the dense table and then join the two dataframes but that doesn't join in the expected way (the columns are not combined). The ultimate goal is to pivot on a and b.
As a follow up because I think kinjo is correct, why does this only work for integers and not for floats? Using:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.arange(1,df.b.max()+0.1, 0.1) for a in np.arange(1,df.a.max()+0.1,0.1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
Will return:
a b data
0 1.0 1.0 NaN
1 1.0 1.1 NaN
2 1.0 1.2 NaN
3 1.0 1.3 NaN
4 1.0 1.4 NaN
5 1.0 1.5 NaN
6 1.0 1.6 NaN
7 1.1 1.0 NaN
8 1.1 1.1 NaN
9 1.1 1.2 NaN
10 1.1 1.3 NaN
11 1.1 1.4 NaN
12 1.1 1.5 NaN
13 1.1 1.6 NaN
14 1.2 1.0 NaN
15 1.2 1.1 NaN
16 1.2 1.2 NaN
17 1.2 1.3 NaN
18 1.2 1.4 NaN
19 1.2 1.5 NaN
20 1.2 1.6 NaN
21 1.3 1.0 NaN
22 1.3 1.1 NaN
23 1.3 1.2 NaN
24 1.3 1.3 NaN
25 1.3 1.4 NaN
26 1.3 1.5 NaN
27 1.3 1.6 NaN
28 1.4 1.0 NaN
29 1.4 1.1 NaN
30 1.4 1.2 NaN
31 1.4 1.3 NaN
32 1.4 1.4 NaN
33 1.4 1.5 NaN
34 1.4 1.6 NaN
35 1.5 1.0 NaN
36 1.5 1.1 NaN
37 1.5 1.2 NaN
38 1.5 1.3 NaN
39 1.5 1.4 NaN
40 1.5 1.5 NaN
41 1.5 1.6 NaN
42 1.6 1.0 NaN
43 1.6 1.1 NaN
44 1.6 1.2 NaN
45 1.6 1.3 NaN
46 1.6 1.4 NaN
47 1.6 1.5 NaN
48 1.6 1.6 NaN

Reindex is a straightforward solution. Similar to #jezrael's solution, but no need for merge.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in range(1,df.b.max()+1) for a in range(1,df.a.max()+1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex)
You can then reset the index if you want the numeric count as your overall index.
In the case that your index is floats you should use linspace and not arange:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.linspace(a_min, a_max, a_step, endpoint=False) for a in np.linspace(b_min, b_max, b_step, endpoint=False)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()

Since you intend to pivot an a and b, you could obtain the pivoted result with
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
result = pd.DataFrame(np.nan, index=range(1,6), columns=range(1,6))
result.update(df.pivot(index='a', columns='b', values='data'))
print(result)
which yields
1 2 3 4 5
1 NaN 0.436389 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN -1.066621
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 0.328880

This is a nice fast approach for converting numeric data from sparse to dense, using SciPy's sparse functionality. Works if your ultimate goal is the pivoted (i.e. dense) dataframe:
import pandas as pd
from scipy.sparse import csr_matrix
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
df_shape = df['a'].max()+1, df['b'].max()+1
sp_df = csr_matrix((df['data'], (df['a'], df['b'])), shape=df_shape)
df_dense = pd.DataFrame.sparse.from_spmatrix(sp_df)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subtract from every value in a DataFrame - python

Solution without using .apply(): df.iloc[:, 1:5] = ( df.iloc[:, 1:5] - df['score'].values.reshape(-1, 1) )

df.loc[:, "movie1":"movie4"] = df.loc[:, "movie1":"movie4"].apply( lambda x: x - df["score"] ) print(df) Prints: userId movie1 movie2 movie3 movie4 score 0 0 2.1 0.1 -1.0 NaN 2 1 1 2.1 0.1 2.4 0.4 1 2 2 -0.2 NaN -1.3 NaN 3 3 3 NaN 1.0 NaN -1.7 4 4 4 NaN NaN NaN NaN 5 5 5 -3.7 NaN -4.0 -2.0 6

Related

Rolling mean on a groupby with min_periods=1, but NOT ignoring NANs

Pandas: Value Counts based on condition

Pandas Reshape DataFrame in a loop

How to reset cumulative sum every time there is a NaN in a pandas dataframe?

converting sparse dataframe to dense dataframe

Categories

Resources