Compare and extract values from two datasets

Compare and extract values from two datasets - python

I have a dataframe:
Name Segment Axis 1 2 3 4 5
0 Amazon 1 slope NaN 100 120 127 140
1 Amazon 1 x 0.0 1.0 2.0 3.0 4.0
2 Amazon 1 y 0.0 0.4 0.8 1.2 1.6
3 Amazon 2 slope NaN 50 57 58 59
4 Amazon 2 x 0.0 2.0 4.0 6.0 8.0
5 Amazon 2 y 0.0 1.0 2.0 3.0 4.0
df2:
Name Segment Optimal Cost
Amazon 1 115
Amazon 2 60
Netflix 1 100
Netflix 2 110
I am trying to compare the slope values in the axis column to the corresponding optimal cost values and extract the slope, x and y values.
The rule is: Find the last first slope value greater than its corresponding optimal cost
If there is no value greater than optimal cost, then report where slope is zero.
If there are only values greater than optimal cost, then report highest y value
Expected output:
Name Segment slope x y
0 Amazon 1 120 2 0.8
1 Amazon 2 NaN 0 0

s=df.set_index(['Name' , 'Segment','Axis']).stack().unstack('Axis').reset_index(level=2, drop=True)#melt dataframe 1
df3=pd.merge(s, df2, on=['Name', 'Segment'], how='left')#merge melted datframewith df2
df3[df3['slope']>df3['Optimal_Cost']].groupby(['Name', 'Segment']).first().reset_index()#Filter as required
Name Segment slope x y Optimal_Cost
0 Amazon 1 120.0 2.0 0.8 115
1 Amazon 2 72.0 6.0 3.0 60

Related

Calculate %-deviation with values from a pandas Dataframe

I am fairly new to python and I have the following dataframe
setting_id subject_id seconds result_id owner_id average duration_id
0 7 1 0 1680.5 2.0 24.000 1.0
1 7 1 3600 1690.5 2.0 46.000 2.0
2 7 1 10800 1700.5 2.0 101.000 4.0
3 7 2 0 1682.5 2.0 12.500 1.0
4 7 2 3600 1692.5 2.0 33.500 2.0
5 7 2 10800 1702.5 2.0 86.500 4.0
6 7 3 0 1684.5 2.0 8.500 1.0
7 7 3 3600 1694.5 2.0 15.000 2.0
8 7 3 10800 1704.5 2.0 34.000 4.0
What I need to do is Calculate the deviation (%) from averages with a "seconds"-value not equal to 0 from those averages with a seconds value of zero, where the subject_id and Setting_id are the same
i.e. setting_id ==7 & subject_id ==1 would be:
(result/baseline)*100
------> for 3600 seconds: (46/24)*100 = +192%
------> for 10800 seconds: (101/24)*100 = +421%
.... baseline = average-result with a seconds value of 0
.... result = average-result with a seconds value other than 0
The resulting df should look like this
setting_id subject_id seconds owner_id average deviation duration_id
0 7 1 0 2 24 0 1
1 7 1 3600 2 46 192 2
2 7 1 10800 2 101 421 4
I want to use these calculations then to plot a regression graph (with seaborn) of deviations from baseline
I have played around with this df for 2 days now and tried different forloops but I just can´t figure out the correct way.

You can use:
# identify rows with 0
m = df['seconds'].eq(0)
# compute the sum of rows with 0
s = (df['average'].where(m)
.groupby([df['setting_id'], df['subject_id']])
.sum()
)
# compute the deviation per group
deviation = (
df[['setting_id', 'subject_id']]
.merge(s, left_on=['setting_id', 'subject_id'], right_index=True, how='left')
['average']
.rdiv(df['average']).mul(100)
.round().astype(int) # optional
.mask(m, 0)
)
df['deviation'] = deviation
# or
# out = df.assign(deviation=deviation)
Output:
setting_id subject_id seconds result_id owner_id average duration_id deviation
0 7 1 0 1680.5 2.0 24.0 1.0 0
1 7 1 3600 1690.5 2.0 46.0 2.0 192
2 7 1 10800 1700.5 2.0 101.0 4.0 421
3 7 2 0 1682.5 2.0 12.5 1.0 0
4 7 2 3600 1692.5 2.0 33.5 2.0 268
5 7 2 10800 1702.5 2.0 86.5 4.0 692
6 7 3 0 1684.5 2.0 8.5 1.0 0
7 7 3 3600 1694.5 2.0 15.0 2.0 176
8 7 3 10800 1704.5 2.0 34.0 4.0 400

Match values in different data frame and find closest value(s)

I have a dataframe:
4 Amazon 2 x 0.0 2.0 4.0 6.0 8.0
5 Amazon 2 y 0.0 1.0 2.0 3.0 4.0
df2:
Amazon 2 60
Netflix 1 100
Netflix 2 110
I am trying to compare the slope values in the axis column to the corresponding optimal cost values and extract the slope, x and y values that are closest to the optimal cost.
Expected output:
0 Amazon 1 120 2 0.8
1 Amazon 2 57 4 2

You can use pd.merge_asof to perform this type of merge quickly. However there is some preprocessing you'll need to do to your data.
reshape df1 to match the format of the expected output (e.g. where "slope", "x", and "y" are columns instead of rows
drop NaNs from the merge keys AND sort both df1 and df2 by their merge keys (this is a requirement of pd.merge_asof that we need to do explicitly). Merge keys are going to be the "slope" and "optimal cost" columns.
Ensure that the merge keys are of the same dtype (in this case they should both be floats, meaning we'll need to convert "optimal cost" to a float type instead of int.
perform the merge operation
# Reshape df1
df1_reshaped = df1.set_index(["Name", "Segment", "Axis"]).unstack(-1).stack(0)
# Drop NaN, sort_values by the merge keys, ensure merge keys are same dtype
df1_reshaped = df1_reshaped.dropna(subset=["slope"]).sort_values("slope")
df2 = df2.sort_values("Optimal Cost").astype({"Optimal Cost": float})
# Perform the merge
out = (
pd.merge_asof(
df2,
df1_reshaped,
left_on="Optimal Cost",
right_on="slope",
by=["Name", "Segment"],
direction="nearest"
).dropna()
)
print(out)
Name Segment Optimal Cost slope x y
0 Amazon 2 60.0 57.0 4.0 2.0
3 Amazon 1 115.0 120.0 2.0 0.8
And that's it!
If you're curious, here are what df1_reshaped and df2 look like prior to the merge (after the preprocessing).
>>> print(df1_reshaped)
Axis slope x y
Name Segment
Amazon 2 2 50.0 2.0 1.0
3 57.0 4.0 2.0
4 72.0 6.0 3.0
5 81.0 8.0 4.0
1 2 100.0 1.0 0.4
3 120.0 2.0 0.8
4 127.0 3.0 1.2
5 140.0 4.0 1.6
>>> print(df2)
Name Segment Optimal Cost
1 Amazon 2 60.0
2 Netflix 1 100.0
3 Netflix 2 110.0
0 Amazon 1 115.0

# Extract data and rearrange index
# Now slope and optim have the same index
slope = df1.loc[df1["Axis"] == "slope"].set_index(["Name", "Segment"]).drop(columns="Axis")
optim = df2.set_index(["Name", "Segment"]).reindex(slope.index)
# Find the closest column to the optimal cost
idx = slope.sub(optim.values).abs().idxmin(axis="columns")
>>> idx
Name Segment
Amazon 1 3 # column '3' 120 <- optimal: 115
2 3 # column '3' 57 <- optimal: 60
dtype: object
>>> df1.set_index(["Name", "Segment", "Axis"]) \
.groupby(["Name", "Segment"], as_index=False) \
.apply(lambda x: x[idx[x.name]]).unstack() \
.rename_axis(columns=None).reset_index(["Name", "Segment"])
Name Segment slope x y
0 Amazon 1 120.0 2.0 0.8
1 Amazon 2 57.0 4.0 2.0

Applying a function to chunks of the Dataframe

I have a Dataframe (df) (for instance - simplified version)
A B
0 2.0 3.0
1 3.0 4.0
and generated 20 bootstrap resamples that are all now in the same df but differ in the Resample Nr.
A B
0 1 0 2.0 3.0
1 1 1 3.0 4.0
2 2 1 3.0 4.0
3 2 1 3.0 4.0
.. ..
.. ..
39 20 0 2.0 3.0
40 20 0 2.0 3.0
Now I want to apply a certain function on each Reample Nr. Say:
C = sum(df['A'] * df['B']) / sum(df['B'] ** 2)
The outlook would look like this:
A B C
0 1 0 2.0 3.0 Calculated Value X1
1 1 1 3.0 4.0 Calculated Value X1
2 2 1 3.0 4.0 Calculated Value X2
3 2 1 3.0 4.0 Calculated Value X2
.. ..
.. ..
39 20 0 2.0 3.0 Calculated Value X20
40 20 0 2.0 3.0 Calculated Value X20
So there are 20 different new values.
I know there is a df.iloc command where I can specify my row selection df.iloc[row, column] but I would like to find a command where I don't have to repeat the code for the 20 samples.
My goal is to find a command that identifies the Resample Nr. automatically and then calculates the function for each Resample Nr.
How can I do this?
Thank you!

Use DataFrame.assign to create two new columns x and y that corresponds to df['A'] * df['B'] and df['B']**2, then use DataFrame.groupby on Resample Nr. (or level=1) and transform using sum:
s = df.assign(x=df['A'].mul(df['B']), y=df['B']**2)\
.groupby(level=1)[['x', 'y']].transform('sum')
df['C'] = s['x'].div(s['y'])
Result:
A B C
0 1 0 2.0 3.0 0.720000
1 1 1 3.0 4.0 0.720000
2 2 1 3.0 4.0 0.750000
3 2 1 3.0 4.0 0.750000
39 20 0 2.0 3.0 0.666667
40 20 0 2.0 3.0 0.666667

how to add a mean column for the groupby movieID?

I have a df like below
userId movieId rating
0 1 31 2.0
1 2 10 4.0
2 2 17 5.0
3 2 39 5.0
4 2 47 4.0
5 3 31 3.0
6 3 10 2.0
I need to add two column, one is mean for each movie, the other is diff which is the difference between rating and mean.
Please note that movieId can be repeated because different users may rate the same movie. Here row 0 and 5 is for movieId 31, row 1 and 6 is for movieId 10
userId movieId rating mean diff
0 1 31 2.0 2.5 -0.5
1 2 10 4.0 3 1
2 2 17 5.0 5 0
3 2 39 5.0 5 0
4 2 47 4.0 4 0
5 3 31 3.0 2.5 0.5
6 3 10 2.0 3 -1
here is some of my code which calculates the mean
df = df.groupby('movieId')['rating'].agg(['count','mean']).reset_index()

You can use transform to keep the same number of rows when calculating mean with groupby. Calculating the difference is straightforward from that:
df['mean'] = df.groupby('movieId')['rating'].transform('mean')
df['diff'] = df['rating'] - df['mean']

Python Pandas: Generate a new column that calculates the subtotal of all the cells above that row in a specific column

Sorry for the seemingly confusing title. The problem shall be really simple but I'm stumped and need some help here.
The data frame that I have now:
New_ID STATE MEAN
0 1 Lagos 7166.101571
1 2 Rivers 2464.065846
2 3 Oyo 1974.699365
3 4 Akwa 1839.126698
4 5 Kano 1757.642462
I want to create a new column that in row i, it will calculate df[:i,'MEAN'].sum()/df['MEAN'].sum()
For example, for data frame:
ID MEAN
0 1.0 5
1 2.0 10
2 3.0 15
3 4.0 30
4 5.0 40
My desired output:
ID MEAN SUBTOTAL
0 1.0 5 0.05
1 2.0 10 0.10
2 3.0 15 0.30
3 4.0 30 0.60
4 5.0 40 1.00
I tried
df1['SUbTotal'] = df1.loc[:df1['New_ID'], 'MEAN']/df1['MEAN'].sum()
but it says:
Name: New_ID, dtype: int32' is an invalid key
Thanks for your time in advance

This should do it, it seems like you're looking for cumsum:
df['SUBTOTAL'] = df.MEAN.cumsum() / df.MEAN.sum()
>>> df
ID MEAN SUBTOTAL
0 1.0 5 0.05
1 2.0 10 0.15
2 3.0 15 0.30
3 4.0 30 0.60
4 5.0 40 1.00

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare and extract values from two datasets - python

Related

Calculate %-deviation with values from a pandas Dataframe

Match values in different data frame and find closest value(s)

Applying a function to chunks of the Dataframe

how to add a mean column for the groupby movieID?

Python Pandas: Generate a new column that calculates the subtotal of all the cells above that row in a specific column

Categories

Resources