csv format of dataframe from pd.concate groupby() dataframe in python - python

Lets say I have multiple data frames that have a format of
Id no
A
B
1
2
1
2
3
5
2
5
6
1
6
7
which I want to group the data frame by "Id" and apply an aggression then store the new values in a different dataframe such as
df_calc = pd.DataFrame(columns=["Mean", "Median", "Std"])
for df in dataframes:
mean = df.groupby(["Id"]).mean()
median = df.groupby(["Id"]).median()
std = df.groupby(["Id"]).std()
df_f = pd.DataFrame(
{"Mean": [mean], "Median": [median], "Std": [std]})
df_calc = pd.concat([df_calc, df_f])
This is the format in which my final dataframe df_calc comes out as
but I would like for it to look like this
How do I go about doing so?

You can try agg multiple functions then swap the column level and reorder the column:
out = df.groupby('Id no').agg({'A': ['median','std','mean'],
'B': ['median','std','mean']})
print(out)
A B
median std mean median std mean
Id no
1 4.0 2.828427 4.0 4.0 4.242641 4.0
2 4.0 1.414214 4.0 5.5 0.707107 5.5
out = out.swaplevel(0, 1, 1).sort_index(axis=1)
print(out)
mean median std
A B A B A B
Id no
1 4.0 4.0 4.0 4.0 2.828427 4.242641
2 4.0 5.5 4.0 5.5 1.414214 0.707107

Related

pandas - ranking with tolerance?

Is there a way to rank values in a dataframe but considering a tolerance?
Say I have the following values
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
and if I ran rank:
ex.rank(method='average')
0 2.0
1 3.0
2 1.0
3 6.0
4 5.0
5 4.0
dtype: float64
But what I'd like as a result would be (with a tolereance of 0.01):
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
Any way to define this tolerance?
Thanks
This function may works:
def rank_with_tolerance(sr, tolerance=0.01+1e-10, method='average'):
vals = pd.Series(sr.unique()).sort_values()
vals.index = vals
vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
return sr.map(vals).fillna(sr).rank(method=method)
It works for your given input:
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
And with more complex sets it seems to work too:
ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 1.0
1 3.0
2 3.0
3 3.0
4 5.5
5 5.5
6 7.0
dtype: float64
You could do some sort of min-max scaling i.e.
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))
# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex) + 1
# result
0 1.335347
1 4.444109
2 1.000000
3 7.000000
4 4.969789
5 4.453172
That way you are still ranking, but values closer to each other have ranks close to each other
You can sort the values, merge the close ones and rank on that:
s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
.transform('mean')
.reindex_like(ex)
)
out = mapper.rank(method='average')
N.B. I used 0.011 as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold
output:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
intermediate mapper:
0 16.520
1 19.955
2 16.150
3 22.770
4 20.530
5 19.955
dtype: float64

Match values in different data frame and find closest value(s)

I have a dataframe:
4 Amazon 2 x 0.0 2.0 4.0 6.0 8.0
5 Amazon 2 y 0.0 1.0 2.0 3.0 4.0
df2:
Amazon 2 60
Netflix 1 100
Netflix 2 110
I am trying to compare the slope values in the axis column to the corresponding optimal cost values and extract the slope, x and y values that are closest to the optimal cost.
Expected output:
0 Amazon 1 120 2 0.8
1 Amazon 2 57 4 2
You can use pd.merge_asof to perform this type of merge quickly. However there is some preprocessing you'll need to do to your data.
reshape df1 to match the format of the expected output (e.g. where "slope", "x", and "y" are columns instead of rows
drop NaNs from the merge keys AND sort both df1 and df2 by their merge keys (this is a requirement of pd.merge_asof that we need to do explicitly). Merge keys are going to be the "slope" and "optimal cost" columns.
Ensure that the merge keys are of the same dtype (in this case they should both be floats, meaning we'll need to convert "optimal cost" to a float type instead of int.
perform the merge operation
# Reshape df1
df1_reshaped = df1.set_index(["Name", "Segment", "Axis"]).unstack(-1).stack(0)
# Drop NaN, sort_values by the merge keys, ensure merge keys are same dtype
df1_reshaped = df1_reshaped.dropna(subset=["slope"]).sort_values("slope")
df2 = df2.sort_values("Optimal Cost").astype({"Optimal Cost": float})
# Perform the merge
out = (
pd.merge_asof(
df2,
df1_reshaped,
left_on="Optimal Cost",
right_on="slope",
by=["Name", "Segment"],
direction="nearest"
).dropna()
)
print(out)
Name Segment Optimal Cost slope x y
0 Amazon 2 60.0 57.0 4.0 2.0
3 Amazon 1 115.0 120.0 2.0 0.8
And that's it!
If you're curious, here are what df1_reshaped and df2 look like prior to the merge (after the preprocessing).
>>> print(df1_reshaped)
Axis slope x y
Name Segment
Amazon 2 2 50.0 2.0 1.0
3 57.0 4.0 2.0
4 72.0 6.0 3.0
5 81.0 8.0 4.0
1 2 100.0 1.0 0.4
3 120.0 2.0 0.8
4 127.0 3.0 1.2
5 140.0 4.0 1.6
>>> print(df2)
Name Segment Optimal Cost
1 Amazon 2 60.0
2 Netflix 1 100.0
3 Netflix 2 110.0
0 Amazon 1 115.0
# Extract data and rearrange index
# Now slope and optim have the same index
slope = df1.loc[df1["Axis"] == "slope"].set_index(["Name", "Segment"]).drop(columns="Axis")
optim = df2.set_index(["Name", "Segment"]).reindex(slope.index)
# Find the closest column to the optimal cost
idx = slope.sub(optim.values).abs().idxmin(axis="columns")
>>> idx
Name Segment
Amazon 1 3 # column '3' 120 <- optimal: 115
2 3 # column '3' 57 <- optimal: 60
dtype: object
>>> df1.set_index(["Name", "Segment", "Axis"]) \
.groupby(["Name", "Segment"], as_index=False) \
.apply(lambda x: x[idx[x.name]]).unstack() \
.rename_axis(columns=None).reset_index(["Name", "Segment"])
Name Segment slope x y
0 Amazon 1 120.0 2.0 0.8
1 Amazon 2 57.0 4.0 2.0

Custom expanding function with raw=False

Consider the following dataframe:
df = pd.DataFrame({
'a': np.arange(1, 5),
'b': np.arange(1, 5) * 2,
'c': np.arange(1, 5) * 3
})
a b c
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
I want to calculate the cumulative sum for each row across the columns:
def expanding_func(s):
return s.sum()
df.expanding(1, axis=1).apply(expanding_func, raw=True)
# As expected:
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
However, if I set raw=False, expanding_func no longer works:
df.expanding(1, axis=1).apply(expanding_func, raw=False)
ValueError: Length of passed values is 3, index implies 4
The documentation says expanding_func
Must produce a single value from an ndarray input if raw=True or a single value from a Series if raw=False.
And that is exactly what I was doing. Why did expanding_func fail when raw=False?
Note: this is only a contrived example. I want to know how to write a custom rolling function, not how to calculate the cumulative sum across columns.
It seems this is a bug with pandas.
If you do:
df.iloc[:3].expanding(1, axis=1).apply(expanding_func, raw=False)
It actually works. It seems when passed as a series, pandas tries to check the number of returned columns with the number of rows of the dataframe for some reason. (it should compare the number of columns of the df)
A workaround is to transpose the df, apply your function and transpose back which seems to work. The bug only seems to affect when axis is set to 1.
df.T.expanding(1, axis=0).apply(expanding_func, raw=False).T
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
dont need to define raw False/True,Just do simple way:
df.expanding(0, axis=1).apply(expanding_func)
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

groupby on subset of a multi index

I have a dataframe (df) with a multi index consisting of 3 indexes, 'A', 'B', and 'C' say, and I have a column called Quantity containing floats.
What I would like to do is perform a groupby on 'A' and 'B' summing the values in Quantity. How would I do this? The standard way of working does not work because pandas does no recognize the indexes as columns and if I use something like
df.groupby(level=0).sum()
it seems I can only select a single level. How would one go about this?
You can specify multiple levels like:
df.groupby(level=[0, 1]).sum()
#alternative
df.groupby(level=['A','B']).sum()
Or pass parameter level to sum:
df.sum(level=[0, 1])
#alternative
df.sum(level=['A','B'])
Sample:
df = pd.DataFrame({'A':[1,1,2,2,3],
'B':[3] * 5,
'C':[3,4,5,4,5],
'Quantity':[1.0,3,4,5,6]}).set_index(['A','B','C'])
print (df)
Quantity
A B C
1 3 3 1.0
4 3.0
2 3 5 4.0
4 5.0
3 3 5 6.0
df1 = df.groupby(level=[0, 1]).sum()
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.groupby(level=['A','B']).sum()
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.sum(level=[0, 1])
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0
df1 = df.sum(level=['A','B'])
print (df1)
Quantity
A B
1 3 4.0
2 3 9.0
3 3 6.0

Groupby apply with multiple arguments

I'd like to calculate the percentile rank based on which Group they belongs to. I have written the following codes and was able to calculate, say zscore as there is only one input. What shall I do with a function which has two arguments? Thanks.
import pandas as pd
import scipy.stats as stats
import numpy as np
funZScore = lambda x: (x - x.mean()) / x.std()
funPercentile = lambda x, y: stats.percentileofscore(x[~np.isnan(x)], y)
A = pd.DataFrame({'Group' : ['A','A','A','A','B','B','B'],
'Value' : [4, 7, None, 6, 2, 8, 1]})
# Compute the Z-score by group
A['Z'] = A.groupby('Group')['Value'].apply(funZScore)
print(A)
Group Value Z
0 A 4.0 -1.091089
1 A 7.0 0.872872
2 A NaN NaN
3 A 6.0 0.218218
4 B 2.0 -0.440225
5 B 8.0 1.144586
6 B 1.0 -0.704361
# compute the percentile rank by group
# how to put two arguments into groupby apply?
# I hope to get something like below
Group Value Z P
0 A 4.0 -1.091089 33.33
1 A 7.0 0.872872 100
2 A NaN NaN NaN
3 A 6.0 0.218218 66.67
4 B 2.0 -0.440225 66.67
5 B 8.0 1.144586 100
6 B 1.0 -0.704361 33.33
I think need:
d = A.groupby('Group')['Value'].apply(list).to_dict()
print (d)
{'A': [4.0, 7.0, nan, 6.0], 'B': [2.0, 8.0, 1.0]}
A['P'] = A.apply(lambda x: funPercentile(np.array(d[x['Group']]), x['Value']), axis=1)
print (A)
Group Value Z P
0 A 4.0 -1.091089 33.333333
1 A 7.0 0.872872 100.000000
2 C NaN NaN NaN
3 A 6.0 0.218218 66.666667
4 B 2.0 -0.440225 66.666667
5 B 8.0 1.144586 100.000000
6 B 1.0 -0.704361 33.333333

Categories