Match values in different data frame and find closest value(s)

Match values in different data frame and find closest value(s) - python

I have a dataframe:
4 Amazon 2 x 0.0 2.0 4.0 6.0 8.0
5 Amazon 2 y 0.0 1.0 2.0 3.0 4.0
df2:
Amazon 2 60
Netflix 1 100
Netflix 2 110
I am trying to compare the slope values in the axis column to the corresponding optimal cost values and extract the slope, x and y values that are closest to the optimal cost.
Expected output:
0 Amazon 1 120 2 0.8
1 Amazon 2 57 4 2

You can use pd.merge_asof to perform this type of merge quickly. However there is some preprocessing you'll need to do to your data.
reshape df1 to match the format of the expected output (e.g. where "slope", "x", and "y" are columns instead of rows
drop NaNs from the merge keys AND sort both df1 and df2 by their merge keys (this is a requirement of pd.merge_asof that we need to do explicitly). Merge keys are going to be the "slope" and "optimal cost" columns.
Ensure that the merge keys are of the same dtype (in this case they should both be floats, meaning we'll need to convert "optimal cost" to a float type instead of int.
perform the merge operation
# Reshape df1
df1_reshaped = df1.set_index(["Name", "Segment", "Axis"]).unstack(-1).stack(0)
# Drop NaN, sort_values by the merge keys, ensure merge keys are same dtype
df1_reshaped = df1_reshaped.dropna(subset=["slope"]).sort_values("slope")
df2 = df2.sort_values("Optimal Cost").astype({"Optimal Cost": float})
# Perform the merge
out = (
pd.merge_asof(
df2,
df1_reshaped,
left_on="Optimal Cost",
right_on="slope",
by=["Name", "Segment"],
direction="nearest"
).dropna()
)
print(out)
Name Segment Optimal Cost slope x y
0 Amazon 2 60.0 57.0 4.0 2.0
3 Amazon 1 115.0 120.0 2.0 0.8
And that's it!
If you're curious, here are what df1_reshaped and df2 look like prior to the merge (after the preprocessing).
>>> print(df1_reshaped)
Axis slope x y
Name Segment
Amazon 2 2 50.0 2.0 1.0
3 57.0 4.0 2.0
4 72.0 6.0 3.0
5 81.0 8.0 4.0
1 2 100.0 1.0 0.4
3 120.0 2.0 0.8
4 127.0 3.0 1.2
5 140.0 4.0 1.6
>>> print(df2)
Name Segment Optimal Cost
1 Amazon 2 60.0
2 Netflix 1 100.0
3 Netflix 2 110.0
0 Amazon 1 115.0

# Extract data and rearrange index
# Now slope and optim have the same index
slope = df1.loc[df1["Axis"] == "slope"].set_index(["Name", "Segment"]).drop(columns="Axis")
optim = df2.set_index(["Name", "Segment"]).reindex(slope.index)
# Find the closest column to the optimal cost
idx = slope.sub(optim.values).abs().idxmin(axis="columns")
>>> idx
Name Segment
Amazon 1 3 # column '3' 120 <- optimal: 115
2 3 # column '3' 57 <- optimal: 60
dtype: object
>>> df1.set_index(["Name", "Segment", "Axis"]) \
.groupby(["Name", "Segment"], as_index=False) \
.apply(lambda x: x[idx[x.name]]).unstack() \
.rename_axis(columns=None).reset_index(["Name", "Segment"])
Name Segment slope x y
0 Amazon 1 120.0 2.0 0.8
1 Amazon 2 57.0 4.0 2.0

Related

How do I find the max value in only specific columns in a row?

If this was my dataframe
a
b
c
12
5
0.1
9
7
8
1.1
2
12.9
I can use the following code to get the max values in each row... (12) (9) (12.9)
df = df.max(axis=1)
But I don't know would you get the max values only comparing columns a & b (12, 9, 2)

Assuming one wants to consider only the columns a and b, and store the maximum value in a new column called max, one can do the following
df['max'] = df[['a', 'b']].max(axis=1)
[Out]:
a b c max
0 12.0 5 0.1 12.0
1 9.0 7 8.0 9.0
2 1.1 2 12.9 2.0
One can also do that with a custom lambda function, as follows
df['max'] = df[['a', 'b']].apply(lambda x: max(x), axis=1)
[Out]:
a b c max
0 12.0 5 0.1 12.0
1 9.0 7 8.0 9.0
2 1.1 2 12.9 2.0
As per OP's request, if one wants to create a new column, max_of_all, that one will use to store the maximum value for all the dataframe columns, one can use the following
df['max_of_all'] = df.max(axis=1)
[Out]:
a b c max max_of_all
0 12.0 5 0.1 12.0 12.0
1 9.0 7 8.0 9.0 9.0
2 1.1 2 12.9 2.0 12.9

csv format of dataframe from pd.concate groupby() dataframe in python

Lets say I have multiple data frames that have a format of
Id no
A
B
1
2
1
2
3
5
2
5
6
1
6
7
which I want to group the data frame by "Id" and apply an aggression then store the new values in a different dataframe such as
df_calc = pd.DataFrame(columns=["Mean", "Median", "Std"])
for df in dataframes:
mean = df.groupby(["Id"]).mean()
median = df.groupby(["Id"]).median()
std = df.groupby(["Id"]).std()
df_f = pd.DataFrame(
{"Mean": [mean], "Median": [median], "Std": [std]})
df_calc = pd.concat([df_calc, df_f])
This is the format in which my final dataframe df_calc comes out as
but I would like for it to look like this
How do I go about doing so?

You can try agg multiple functions then swap the column level and reorder the column:
out = df.groupby('Id no').agg({'A': ['median','std','mean'],
'B': ['median','std','mean']})
print(out)
A B
median std mean median std mean
Id no
1 4.0 2.828427 4.0 4.0 4.242641 4.0
2 4.0 1.414214 4.0 5.5 0.707107 5.5
out = out.swaplevel(0, 1, 1).sort_index(axis=1)
print(out)
mean median std
A B A B A B
Id no
1 4.0 4.0 4.0 4.0 2.828427 4.242641
2 4.0 5.5 4.0 5.5 1.414214 0.707107

Compare and extract values from two datasets

I have a dataframe:
Name Segment Axis 1 2 3 4 5
0 Amazon 1 slope NaN 100 120 127 140
1 Amazon 1 x 0.0 1.0 2.0 3.0 4.0
2 Amazon 1 y 0.0 0.4 0.8 1.2 1.6
3 Amazon 2 slope NaN 50 57 58 59
4 Amazon 2 x 0.0 2.0 4.0 6.0 8.0
5 Amazon 2 y 0.0 1.0 2.0 3.0 4.0
df2:
Name Segment Optimal Cost
Amazon 1 115
Amazon 2 60
Netflix 1 100
Netflix 2 110
I am trying to compare the slope values in the axis column to the corresponding optimal cost values and extract the slope, x and y values.
The rule is: Find the last first slope value greater than its corresponding optimal cost
If there is no value greater than optimal cost, then report where slope is zero.
If there are only values greater than optimal cost, then report highest y value
Expected output:
Name Segment slope x y
0 Amazon 1 120 2 0.8
1 Amazon 2 NaN 0 0

s=df.set_index(['Name' , 'Segment','Axis']).stack().unstack('Axis').reset_index(level=2, drop=True)#melt dataframe 1
df3=pd.merge(s, df2, on=['Name', 'Segment'], how='left')#merge melted datframewith df2
df3[df3['slope']>df3['Optimal_Cost']].groupby(['Name', 'Segment']).first().reset_index()#Filter as required
Name Segment slope x y Optimal_Cost
0 Amazon 1 120.0 2.0 0.8 115
1 Amazon 2 72.0 6.0 3.0 60

Removing specific value from cell of dataframe and shifting the value towards left

I'm working with the pandas data frame. I have unwanted data in some cells. I need to clear that data from specific cells and shift the whole row towards left by one cell. I have tried couple of things but it is not working for me. Here is the example dataframe
userId movieId ratings extra
0 1 500 3.5
1 1 600 4.5
2 1 www.abcd 700 2.0
3 2 1100 5.0
4 2 1200 4.0
5 3 600 4.5
6 4 600 5.0
7 4 1900 3.5
Expected Outcome:
userId movieId ratings extra
0 1 500 3.5
1 1 600 4.5
2 1 700 2.0
3 2 1100 5.0
4 2 1200 4.0
5 3 600 4.5
6 4 600 5.0
7 4 1900 3.5
I have tried the following code but it is showing the following error.
raw = df[f['ratings'].str.contains('www')==True]
#Here I am trying to fix the specific cell value to empty but it shows the following error.
**AttributeError:** 'str' object has no attribute 'at'
df = df.at[raw, 'movieId'] = ' '
#code for shifting the cell value to left
df.iloc[raw,2:-1] = df.iloc[raw,2:-1].shift(-1,axis=1)

You can shift values by mask, but is realy important match types, it means if column movieId is filled by strings (because at leas one string) is necessary convert it to numeric by to_numeric for avoid data lost, because different types:
m = df['movieId'].str.contains('www')
df['movieId'] = pd.to_numeric(df['movieId'], errors='coerce')
#if want shift only missing values rows
#m = df['movieId'].isna()
df[m] = df[m].shift(-1, axis=1)
df['userId'] = df['userId'].ffill()
df = df.drop('extra', axis=1)
print (df)
userId movieId ratings
0 1.0 500.0 3.5
1 1.0 600.0 4.5
2 1.0 700.0 2.0
3 2.0 1100.0 5.0
4 2.0 1200.0 4.0
5 3.0 600.0 4.5
6 4.0 600.0 5.0
7 4.0 1900.0 3.5
If omit converting to numeric get missing value:
m = df['movieId'].str.contains('www')
df[m] = df[m].shift(-1, axis=1)
df['userId'] = df['userId'].ffill()
df = df.drop('extra', axis=1)
print (df)
userId movieId ratings
0 1.0 500 3.5
1 1.0 600 4.5
2 1.0 NaN 2.0
3 2.0 1100 5.0
4 2.0 1200 4.0
5 3.0 600 4.5
6 4.0 600 5.0
7 4.0 1900 3.5

You can try this:-
df['movieId'] = pd.to_numeric(df['movieId'], errors='coerce')
df = df.sort_values(by = 'movieId', ascending = 'True')

Custom expanding function with raw=False

Consider the following dataframe:
df = pd.DataFrame({
'a': np.arange(1, 5),
'b': np.arange(1, 5) * 2,
'c': np.arange(1, 5) * 3
})
a b c
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
I want to calculate the cumulative sum for each row across the columns:
def expanding_func(s):
return s.sum()
df.expanding(1, axis=1).apply(expanding_func, raw=True)
# As expected:
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
However, if I set raw=False, expanding_func no longer works:
df.expanding(1, axis=1).apply(expanding_func, raw=False)
ValueError: Length of passed values is 3, index implies 4
The documentation says expanding_func
Must produce a single value from an ndarray input if raw=True or a single value from a Series if raw=False.
And that is exactly what I was doing. Why did expanding_func fail when raw=False?
Note: this is only a contrived example. I want to know how to write a custom rolling function, not how to calculate the cumulative sum across columns.

It seems this is a bug with pandas.
If you do:
df.iloc[:3].expanding(1, axis=1).apply(expanding_func, raw=False)
It actually works. It seems when passed as a series, pandas tries to check the number of returned columns with the number of rows of the dataframe for some reason. (it should compare the number of columns of the df)
A workaround is to transpose the df, apply your function and transpose back which seems to work. The bug only seems to affect when axis is set to 1.
df.T.expanding(1, axis=0).apply(expanding_func, raw=False).T
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

dont need to define raw False/True,Just do simple way:
df.expanding(0, axis=1).apply(expanding_func)
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match values in different data frame and find closest value(s) - python

Related

How do I find the max value in only specific columns in a row?

csv format of dataframe from pd.concate groupby() dataframe in python

Compare and extract values from two datasets

Removing specific value from cell of dataframe and shifting the value towards left

Custom expanding function with raw=False

Categories

Resources