Custom expanding function with raw=False - python

Consider the following dataframe:
df = pd.DataFrame({
'a': np.arange(1, 5),
'b': np.arange(1, 5) * 2,
'c': np.arange(1, 5) * 3
})
a b c
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
I want to calculate the cumulative sum for each row across the columns:
def expanding_func(s):
return s.sum()
df.expanding(1, axis=1).apply(expanding_func, raw=True)
# As expected:
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
However, if I set raw=False, expanding_func no longer works:
df.expanding(1, axis=1).apply(expanding_func, raw=False)
ValueError: Length of passed values is 3, index implies 4
The documentation says expanding_func
Must produce a single value from an ndarray input if raw=True or a single value from a Series if raw=False.
And that is exactly what I was doing. Why did expanding_func fail when raw=False?
Note: this is only a contrived example. I want to know how to write a custom rolling function, not how to calculate the cumulative sum across columns.

It seems this is a bug with pandas.
If you do:
df.iloc[:3].expanding(1, axis=1).apply(expanding_func, raw=False)
It actually works. It seems when passed as a series, pandas tries to check the number of returned columns with the number of rows of the dataframe for some reason. (it should compare the number of columns of the df)
A workaround is to transpose the df, apply your function and transpose back which seems to work. The bug only seems to affect when axis is set to 1.
df.T.expanding(1, axis=0).apply(expanding_func, raw=False).T
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

dont need to define raw False/True,Just do simple way:
df.expanding(0, axis=1).apply(expanding_func)
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

Related

How do I find the max value in only specific columns in a row?

If this was my dataframe
a
b
c
12
5
0.1
9
7
8
1.1
2
12.9
I can use the following code to get the max values in each row... (12) (9) (12.9)
df = df.max(axis=1)
But I don't know would you get the max values only comparing columns a & b (12, 9, 2)
Assuming one wants to consider only the columns a and b, and store the maximum value in a new column called max, one can do the following
df['max'] = df[['a', 'b']].max(axis=1)
[Out]:
a b c max
0 12.0 5 0.1 12.0
1 9.0 7 8.0 9.0
2 1.1 2 12.9 2.0
One can also do that with a custom lambda function, as follows
df['max'] = df[['a', 'b']].apply(lambda x: max(x), axis=1)
[Out]:
a b c max
0 12.0 5 0.1 12.0
1 9.0 7 8.0 9.0
2 1.1 2 12.9 2.0
As per OP's request, if one wants to create a new column, max_of_all, that one will use to store the maximum value for all the dataframe columns, one can use the following
df['max_of_all'] = df.max(axis=1)
[Out]:
a b c max max_of_all
0 12.0 5 0.1 12.0 12.0
1 9.0 7 8.0 9.0 9.0
2 1.1 2 12.9 2.0 12.9

pandas - ranking with tolerance?

Is there a way to rank values in a dataframe but considering a tolerance?
Say I have the following values
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
and if I ran rank:
ex.rank(method='average')
0 2.0
1 3.0
2 1.0
3 6.0
4 5.0
5 4.0
dtype: float64
But what I'd like as a result would be (with a tolereance of 0.01):
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
Any way to define this tolerance?
Thanks
This function may works:
def rank_with_tolerance(sr, tolerance=0.01+1e-10, method='average'):
vals = pd.Series(sr.unique()).sort_values()
vals.index = vals
vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
return sr.map(vals).fillna(sr).rank(method=method)
It works for your given input:
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
And with more complex sets it seems to work too:
ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 1.0
1 3.0
2 3.0
3 3.0
4 5.5
5 5.5
6 7.0
dtype: float64
You could do some sort of min-max scaling i.e.
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))
# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex) + 1
# result
0 1.335347
1 4.444109
2 1.000000
3 7.000000
4 4.969789
5 4.453172
That way you are still ranking, but values closer to each other have ranks close to each other
You can sort the values, merge the close ones and rank on that:
s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
.transform('mean')
.reindex_like(ex)
)
out = mapper.rank(method='average')
N.B. I used 0.011 as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold
output:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
intermediate mapper:
0 16.520
1 19.955
2 16.150
3 22.770
4 20.530
5 19.955
dtype: float64

Match values in different data frame and find closest value(s)

I have a dataframe:
4 Amazon 2 x 0.0 2.0 4.0 6.0 8.0
5 Amazon 2 y 0.0 1.0 2.0 3.0 4.0
df2:
Amazon 2 60
Netflix 1 100
Netflix 2 110
I am trying to compare the slope values in the axis column to the corresponding optimal cost values and extract the slope, x and y values that are closest to the optimal cost.
Expected output:
0 Amazon 1 120 2 0.8
1 Amazon 2 57 4 2
You can use pd.merge_asof to perform this type of merge quickly. However there is some preprocessing you'll need to do to your data.
reshape df1 to match the format of the expected output (e.g. where "slope", "x", and "y" are columns instead of rows
drop NaNs from the merge keys AND sort both df1 and df2 by their merge keys (this is a requirement of pd.merge_asof that we need to do explicitly). Merge keys are going to be the "slope" and "optimal cost" columns.
Ensure that the merge keys are of the same dtype (in this case they should both be floats, meaning we'll need to convert "optimal cost" to a float type instead of int.
perform the merge operation
# Reshape df1
df1_reshaped = df1.set_index(["Name", "Segment", "Axis"]).unstack(-1).stack(0)
# Drop NaN, sort_values by the merge keys, ensure merge keys are same dtype
df1_reshaped = df1_reshaped.dropna(subset=["slope"]).sort_values("slope")
df2 = df2.sort_values("Optimal Cost").astype({"Optimal Cost": float})
# Perform the merge
out = (
pd.merge_asof(
df2,
df1_reshaped,
left_on="Optimal Cost",
right_on="slope",
by=["Name", "Segment"],
direction="nearest"
).dropna()
)
print(out)
Name Segment Optimal Cost slope x y
0 Amazon 2 60.0 57.0 4.0 2.0
3 Amazon 1 115.0 120.0 2.0 0.8
And that's it!
If you're curious, here are what df1_reshaped and df2 look like prior to the merge (after the preprocessing).
>>> print(df1_reshaped)
Axis slope x y
Name Segment
Amazon 2 2 50.0 2.0 1.0
3 57.0 4.0 2.0
4 72.0 6.0 3.0
5 81.0 8.0 4.0
1 2 100.0 1.0 0.4
3 120.0 2.0 0.8
4 127.0 3.0 1.2
5 140.0 4.0 1.6
>>> print(df2)
Name Segment Optimal Cost
1 Amazon 2 60.0
2 Netflix 1 100.0
3 Netflix 2 110.0
0 Amazon 1 115.0
# Extract data and rearrange index
# Now slope and optim have the same index
slope = df1.loc[df1["Axis"] == "slope"].set_index(["Name", "Segment"]).drop(columns="Axis")
optim = df2.set_index(["Name", "Segment"]).reindex(slope.index)
# Find the closest column to the optimal cost
idx = slope.sub(optim.values).abs().idxmin(axis="columns")
>>> idx
Name Segment
Amazon 1 3 # column '3' 120 <- optimal: 115
2 3 # column '3' 57 <- optimal: 60
dtype: object
>>> df1.set_index(["Name", "Segment", "Axis"]) \
.groupby(["Name", "Segment"], as_index=False) \
.apply(lambda x: x[idx[x.name]]).unstack() \
.rename_axis(columns=None).reset_index(["Name", "Segment"])
Name Segment slope x y
0 Amazon 1 120.0 2.0 0.8
1 Amazon 2 57.0 4.0 2.0

Pandas: how to join two dataframes combinatorially [duplicate]

This question already has answers here:
cartesian product in pandas
(13 answers)
Closed 4 years ago.
I have two dataframes that I would like to combine combinatorial-wise (i.e. combinatorially join each row from one df to each row of another df). I can do this by merging on 'key's but my solution is clearly cumbersome. I'm looking for a more straightforward, even pythonesque way of handling this operation. Any suggestions?
MWE:
fred = pd.DataFrame({'A':[1., 4.],'B':[2., 5.], 'C':[3., 6.]})
print(fred)
A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
jim = pd.DataFrame({'one':['a', 'c'],'two':['b', 'd']})
print(jim)
one two
0 a b
1 c d
fred['key'] = [1,2]
jim1 = jim.copy()
jim1['key'] = 1
jim2 = jim.copy()
jim2['key'] = 2
jim3 = jim1.append(jim2)
jack = pd.merge(fred, jim3, on='key').drop(['key'], axis=1)
print(jack)
A B C one two
0 1.0 2.0 3.0 a b
1 1.0 2.0 3.0 c d
2 4.0 5.0 6.0 a b
3 4.0 5.0 6.0 c d
You can join every row of fred with every row of jim by merging on a key column which is equal to the same value (say, 1) for every row:
In [16]: pd.merge(fred.assign(key=1), jim.assign(key=1), on='key').drop('key', axis=1)
Out[16]:
A B C one two
0 1.0 2.0 3.0 a b
1 1.0 2.0 3.0 c d
2 4.0 5.0 6.0 a b
3 4.0 5.0 6.0 c d
Are you looking for the cartesian product of the two dataframes, like a cross join?
It is answered here.

Pandas Rank Multiple Columns for huge dataset using Threadpool

I need to rank each column of the dataframe. I'm currently using the below code:
for x in range(1,len(cols)):
data[cols[x]] = data[cols[x]].rank(ascending=0)
This works for small dataset. I have more than 50,000 columns and 20,000 rows. Is there a way I can achieve faster with Threadpool. Tried the below code but it didn't work. It is returning empty set.
cols = rankDset.columns.tolist()
def rank_columns(c):
rankDset[c] = rankDset[c].rank(ascending=0)
def parallelDataframe(df, func):
pool = Pool(8)
pool.map(func, cols)
pool.close()
pool.join()
parallelDataframe(rankDset, rank_columns)
You should be able to rank every column by using pd.DataFrame.rank:
df.rank()
From Docs
Compute numerical data ranks (1 through n) along axis.
axis: {0 or ‘index’, 1 or ‘columns’}, default 0
index to direct ranking
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
A=np.random.choice(np.arange(10), 5, False),
B=np.random.choice(np.arange(10), 5, False),
C=np.random.choice(np.arange(10), 5, False),
D=np.random.choice(np.arange(10), 5, False),
))
df
A B C D
0 9 1 6 0
1 4 3 8 2
2 5 5 9 6
3 1 9 7 1
4 7 4 3 9
Then ranking produces
df.rank()
A B C D
0 5.0 1.0 2.0 1.0
1 2.0 2.0 4.0 3.0
2 3.0 4.0 5.0 4.0
3 1.0 5.0 3.0 2.0
4 4.0 3.0 1.0 5.0

Categories