I have a df like below
userId movieId rating
0 1 31 2.0
1 2 10 4.0
2 2 17 5.0
3 2 39 5.0
4 2 47 4.0
5 3 31 3.0
6 3 10 2.0
I need to add two column, one is mean for each movie, the other is diff which is the difference between rating and mean.
Please note that movieId can be repeated because different users may rate the same movie. Here row 0 and 5 is for movieId 31, row 1 and 6 is for movieId 10
userId movieId rating mean diff
0 1 31 2.0 2.5 -0.5
1 2 10 4.0 3 1
2 2 17 5.0 5 0
3 2 39 5.0 5 0
4 2 47 4.0 4 0
5 3 31 3.0 2.5 0.5
6 3 10 2.0 3 -1
here is some of my code which calculates the mean
df = df.groupby('movieId')['rating'].agg(['count','mean']).reset_index()
You can use transform to keep the same number of rows when calculating mean with groupby. Calculating the difference is straightforward from that:
df['mean'] = df.groupby('movieId')['rating'].transform('mean')
df['diff'] = df['rating'] - df['mean']
Related
I am fairly new to python and I have the following dataframe
setting_id subject_id seconds result_id owner_id average duration_id
0 7 1 0 1680.5 2.0 24.000 1.0
1 7 1 3600 1690.5 2.0 46.000 2.0
2 7 1 10800 1700.5 2.0 101.000 4.0
3 7 2 0 1682.5 2.0 12.500 1.0
4 7 2 3600 1692.5 2.0 33.500 2.0
5 7 2 10800 1702.5 2.0 86.500 4.0
6 7 3 0 1684.5 2.0 8.500 1.0
7 7 3 3600 1694.5 2.0 15.000 2.0
8 7 3 10800 1704.5 2.0 34.000 4.0
What I need to do is Calculate the deviation (%) from averages with a "seconds"-value not equal to 0 from those averages with a seconds value of zero, where the subject_id and Setting_id are the same
i.e. setting_id ==7 & subject_id ==1 would be:
(result/baseline)*100
------> for 3600 seconds: (46/24)*100 = +192%
------> for 10800 seconds: (101/24)*100 = +421%
.... baseline = average-result with a seconds value of 0
.... result = average-result with a seconds value other than 0
The resulting df should look like this
setting_id subject_id seconds owner_id average deviation duration_id
0 7 1 0 2 24 0 1
1 7 1 3600 2 46 192 2
2 7 1 10800 2 101 421 4
I want to use these calculations then to plot a regression graph (with seaborn) of deviations from baseline
I have played around with this df for 2 days now and tried different forloops but I just canĀ“t figure out the correct way.
You can use:
# identify rows with 0
m = df['seconds'].eq(0)
# compute the sum of rows with 0
s = (df['average'].where(m)
.groupby([df['setting_id'], df['subject_id']])
.sum()
)
# compute the deviation per group
deviation = (
df[['setting_id', 'subject_id']]
.merge(s, left_on=['setting_id', 'subject_id'], right_index=True, how='left')
['average']
.rdiv(df['average']).mul(100)
.round().astype(int) # optional
.mask(m, 0)
)
df['deviation'] = deviation
# or
# out = df.assign(deviation=deviation)
Output:
setting_id subject_id seconds result_id owner_id average duration_id deviation
0 7 1 0 1680.5 2.0 24.0 1.0 0
1 7 1 3600 1690.5 2.0 46.0 2.0 192
2 7 1 10800 1700.5 2.0 101.0 4.0 421
3 7 2 0 1682.5 2.0 12.5 1.0 0
4 7 2 3600 1692.5 2.0 33.5 2.0 268
5 7 2 10800 1702.5 2.0 86.5 4.0 692
6 7 3 0 1684.5 2.0 8.5 1.0 0
7 7 3 3600 1694.5 2.0 15.0 2.0 176
8 7 3 10800 1704.5 2.0 34.0 4.0 400
df have
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
df want- add a new row to the measure for each id, called spend, calculated by subtracting measure=income - measure=savings, for each of the periods t1,t2,t3, for each id
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
1 spend 9 13 9
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
2 spend 1 14 68
Trying:
df.loc[df['Measure'] == 'spend'] =
df.loc[df['Measure'] == 'income']-
(df.loc[df['Measure'] == 'savings'])
Failing because I am not incorporating groupby for desired outcome
Here is one way using groupby diff
df1=df[df.measure.isin(['savings','spend'])].copy()
s=df1.groupby('id',sort=False).diff().dropna().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')
df
Out[276]:
id measure t1 t2 t3
0 1 savings 1.0 2.0 5.0
1 1 income 10.0 15.0 14.0
1 1 spend 9.0 13.0 9.0
2 2 savings 3.0 6.0 12.0
3 2 income 4.0 20.0 80.0
3 2 spend 1.0 14.0 68.0
Update
df1=df.copy()
df1.loc[df.measure.ne('income'),'t1':]*=-1
s=df1.groupby('id',sort=False).sum().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')
I'm having trouble understanding pandas reindex. I have a series of measurements, munged into a multi-index df, and I'd like to reindex and interpolate those measurements to align them with some other data.
My actual data has ~7 index levels and several different measurements. I hope the solution for this toy data problem is applicable to my real data. It's "small data"; each individual measurement is a couple KB.
Here's a pair of toy problems, one which shows the expected behavior and one which doesn't seem to do anything.
Single-level index, works as expected:
"""
step,value
1,1
3,2
5,1
"""
df_i = pd.read_clipboard(sep=",").set_index("step")
print(df_i)
new_index = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
df_i = df_i.reindex(new_index).interpolate()
print(df_i)
Outputs, the original df and the re-indexed and interpolated one:
value
step
1 1
3 2
5 1
value
step
1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
Works great.
Multi-index, currently not working:
"""
sample,meas_id,step,value
1,1,1,1
1,1,3,2
1,1,5,1
1,2,3,2
1,2,5,2
1,2,7,1
1,2,9,0
"""
df_mi = pd.read_clipboard(sep=",").set_index(["sample", "meas_id", "step"])
print(df_mi)
df_mi = df_mi.reindex(new_index, level="step").interpolate()
print(df_mi)
Output, unchanged after reindex (and therefore after interpolate):
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
How do I actually reindex a column in a multi-index df?
Here's the output I'd like, assuming linear interpolation:
value
sample meas_id step
1 1 1 1
2 1.5
3 2
5 1
6 1
7 1
8 1
9 1
2 1 NaN (or 2)
2 NaN (or 2)
3 2
4 2
5 2
6 1.5
7 1
8 0.5
9 0
I spent some sincere time looking over SO, and if the answer is in there, I missed it:
Fill multi-index Pandas DataFrame with interpolation
Resampling Within a Pandas MultiIndex
pandas multiindex dataframe, ND interpolation for missing values
Fill multi-index Pandas DataFrame with interpolation
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing
Possibly related GitHub issues:
https://github.com/numpy/numpy/issues/11975
https://github.com/pandas-dev/pandas/issues/23104
https://github.com/pandas-dev/pandas/issues/17132
IIUC create the index by using MultiIndex.from_product, then just do reindex
idx=pd.MultiIndex.from_product([df_mi.index.levels[0],df_mi.index.levels[1],new_index])
df_mi.reindex(idx).interpolate()
Out[161]:
value
1 1 1 1.000000
2 1.500000
3 2.000000
4 1.500000
5 1.000000
6 1.142857
7 1.285714
8 1.428571
9 1.571429
2 1 1.714286 # here is bad , it take previous value into consideration
2 1.857143
3 2.000000
4 2.000000
5 2.000000
6 1.500000
7 1.000000
8 0.500000
9 0.000000
My think
def idx(x):
idx = pd.MultiIndex.from_product([x.index.get_level_values(0).unique(), x.index.get_level_values(1).unique(), new_index])
return idx
pd.concat([y.reindex(idx(y)).interpolate() for _,y in df_mi.groupby(level=[0,1])])
value
1 1 1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
2 1 NaN
2 NaN
3 2.0
4 2.0
5 2.0
6 1.5
7 1.0
8 0.5
9 0.0
I've 2 data frames in pandas.
in_degree:
Target in_degree
0 2 1
1 4 24
2 5 53
3 6 98
4 7 34
out_degree
Source out_degree
0 1 4
1 2 4
2 3 5
3 4 5
4 5 5
By comparing 2 columns, I'd like to create a new data frame which should add columns "in_degree" and "out_degree" and display the result.
The Sample output should look like
Source/Target out_degree
0 1 4
1 2 5
2 3 5
3 4 29
4 5 58
Any help would be appreciated.
Thanks.
Traditionally, this would need a merge, but I think you can take advantage of pandas' index aligned arithmetic to do this a bit faster.
x = df2.set_index('Source')
y = df1.set_index('Target').rename_axis('Source')
y.columns = x.columns
x.add(y.reindex(x.index), fill_value=0).reset_index()
Source out_degree
0 1 4.0
1 2 5.0
2 3 5.0
3 4 29.0
4 5 58.0
The "traditional" SQL way of solving this would be using merge:
v = df1.merge(df2, left_on='Target', right_on='Source', how='right')
dct = dict(
Source=v['Source'],
out_degree=v['in_degree'].add(v['out_degree'], fill_value=0))
pd.DataFrame(dct).sort_values('Source')
Source out_degree
3 1 4.0
0 2 5.0
4 3 5.0
1 4 29.0
2 5 58.0
I have a question for which I have
a dataframe which looks like (example):
index ID time value
0 1 2h 10
1 1 2.15h 15
2 1 2.30h 5
3 1 2.45h 24
4 2 2.15h 6
5 2 2.30h 12
6 2 2.45h 18
7 3 2.15h 2
8 3 2.30h 1
I would like to keep the maximum number of ID row overlapping.
So:
index ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
I know I can create a df with unique times and then merge each ID separately to it and then keep all rows with all IDs filled for each time but this is quite impractical. I have looked but have not found an answer for a possible smarter way. Does someone have an idea how to make this more practical?
Use:
cols = df.groupby(['ID', 'time']).size().unstack().dropna(axis=1).columns
df = df[df['time'].isin(cols)]
print (df)
ID time value
1 1 2.15h 15
2 1 2.30h 5
4 2 2.15h 6
5 2 2.30h 12
7 3 2.15h 2
8 3 2.30h 1
Details:
First aggregate DataFrame by groupby and size, then reshape by unstack - NaNs are created for non overlapping values:
print (df.groupby(['ID', 'time']).size().unstack())
time 2.15h 2.30h 2.45h 2h
ID
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 NaN
3 1.0 1.0 NaN NaN
Remove columns with dropna and get columns names:
print (df.groupby(['ID', 'time']).size().unstack().dropna(axis=1))
time 2.15h 2.30h
ID
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
And last filter list by isin and boolean indexing:
df = df[df['time'].isin(cols)]