groupby transfrom with lambda - python

I have the following table:
df = pd.DataFrame({"A":['CH','CH','NU','NU','J'],
"B":['US','AU','Q','US','Q'],
"TOTAL":[10,13,3,1,18]})
And I wish to get the ratio of B with respect to its total for A. So the end result should be:
what I do is:
df['sum'] = df.groupby(['A'])['TOTAL'].transform(np.sum)
df['ratio'] = df['TOTAL']/df['sum']*100
Question: how can one achieve this with a lambda (or is there a better way).

If you want to use a lambda you can do the division inside transform:
df['ratio'] = df.groupby('A')['TOTAL'].transform(lambda x: x / x.sum() * 100)
Output:
A B TOTAL sum ratio
0 CH US 10 23 43.478261
1 CH AU 13 23 56.521739
2 NU Q 3 4 75.000000
3 NU US 1 4 25.000000
4 J Q 18 18 100.000000
But this is slower (because we go group-by-group). If I were you, I'd choose your code over this one.

Related

Don't want to use for loop in Pandas. How can I use Apply in this case?

I know that for loops is not a good thing in Pandas. apply could be better. But I found it is hard to use apply in my quesiton.
data = {'A':[1,1,1,2,2], 'B':[2018,2019,2020,2019,2020],'PR':[12,10,0,24,20],'WP':[300,0,0,300,0],'BD':[6,5,0,2,1],'i':[1,2,1,1,2],'r':[0.5,0.25,0,0.5,0.25]}
df = pd.DataFrame(data)
df['X'] = 0
df['Y'] = 0
df['Z'] = 0
The original dataframe:
[]
My aim is:
Divide the df to two groups, according to A.
For each group, calculate the X Y and Z
X = (Z in last year + PR in current year) * i in this year
Y = Z in last year + WP movement from last year to this year + BD movement from last year to this year + X in this year
Z = Y in this year * r in this year.
The following is my code, it works well. But I don't want to use for loop. Are there any better methods?
# divide the df to two groups
sub_df = [df[df['A'].isin([i])] for i in np.unique(df['A'])]
a = []
for df in sub_df:
df = df.copy()
df.loc[-1] = [0]*df.shape[1] #add a 0 row to calculate the first year.
df.sort_index(inplace = True)
df.reset_index(drop=True, inplace=True)
for n in range(1,df.shape[0]):
df.loc[n,'X'] = (df.loc[n-1,'Z'] + df.loc[n,'PR']) * df.loc[n,'i']
df.loc[n,'Y'] = df.loc[n-1,'Z'] + df.loc[n,'WP'] - df.loc[n-1,'WP'] + df.loc[n,'BD'] - df.loc[n-1,'BD'] + df.loc[n,'X']
df.loc[n,'Z'] = df.loc[n,'Y'] * df.loc[n,'r']
a.append(df[1:])
b = pd.concat(a)
b
I don't know any other method that isn't using for-loop, but I managed to simplify the code a little bit:
def get_df(df):
X, Y, Z = [], [], []
prev_Z, prev_WP, prev_BD = 0, 0, 0
for pr, i, wp, bd, r in zip(df["PR"], df["i"], df["WP"], df["BD"], df["r"]):
X.append((prev_Z + pr) * i)
Y.append(prev_Z + wp - prev_WP + bd - prev_BD + X[-1])
Z.append(Y[-1] * r)
prev_Z, prev_WP, prev_BD = Z[-1], wp, bd
df["X"] = X
df["Y"] = Y
df["Z"] = Z
return df
out = df.groupby("A").apply(get_df)
print(out)
Prints:
A B PR WP BD i r X Y Z
0 1 2018 12 300 6 1 0.50 12.0 318.0 159.0
1 1 2019 10 0 5 2 0.25 338.0 196.0 49.0
2 1 2020 0 0 0 1 0.00 49.0 93.0 0.0
3 2 2019 24 300 2 1 0.50 24.0 326.0 163.0
4 2 2020 20 0 1 2 0.25 366.0 228.0 57.0
according my timing, your code takes 0.004833603044971824 seconds to complete, my version 0.0019200179958716035 seconds, so ~2.5x faster.

length of list len(list) resulting wrong value in Python

It might sound trivial but I am surprised by the output. Basically, I have am calculating y = m*x + b for given a, b & x. With below code I am able to get the desired result of y which a list of 20 values.
But when I am checking the length of the list, I am getting 1 in return. And the range is (0,1) which is weird as I was expecting it to be 20.
Am I making any mistake here?
a = 10
b = 0
x = df['x']
print(x)
0 0.000000
1 0.052632
2 0.105263
3 0.157895
4 0.210526
5 0.263158
6 0.315789
7 0.368421
8 0.421053
9 0.473684
10 0.526316
11 0.578947
12 0.631579
13 0.684211
14 0.736842
15 0.789474
16 0.842105
17 0.894737
18 0.947368
19 1.000000
y_new = []
for i in x:
y = a*x +b
y_new.append(y)
len(y_new)
Output: 1
print(y_new)
[0 0.000000
1 0.526316
2 1.052632
3 1.578947
4 2.105263
5 2.631579
6 3.157895
7 3.684211
8 4.210526
9 4.736842
10 5.263158
11 5.789474
12 6.315789
13 6.842105
14 7.368421
15 7.894737
16 8.421053
17 8.947368
18 9.473684
19 10.000000
Name: x, dtype: float64]
I would propose two solutions:
The first solution is : you convert your columnn df['x'] into a list by doing df['x'].tolist() and you re-run your code and also you should replace ax+b by ai+b
The second solution is (which I would do): You convert your df['x'] into an array by doing x = np.array(df['x']). By doing this you can do some array broadcasting.
So, your code will simply be :
x = np.array(df['x'])
y = a*x + b
This should give you the desired output.
I hope this would be helpful
With the code below, I have a length of 20 for the array y_new. Are you sure to print the right value? According to this post, df['x'] returns a panda Series so df['x'] is equivalent to pd.Series(...).
df['x'] — index a column named 'x'. Returns pd.Series
import pandas as pd
a = 10
b = 0
x = pd.Series(data=[0.000000,0.052632,0.105263,0.157895,0.210526, 0.263158, 0.315789, 0.368421, 0.421053,0.473684,0.526316,0.578947,0.631579
,0.684211,0.736842,0.789474,0.842105,0.894737,0.947368,1.000000])
y_new = []
for i in x:
y = a*x +b
y_new.append(y)
print("y_new length: " + str(len(y_new)) )
Output:
y_new length: 20

how to loop a dataframe with increment factor based on a particular column value

The dataframe I am working with looks like this:
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
Here,
the features have the following meanings:
FWeight - weight of each fragment (or row)
fsim - similarity score between the two columns cap1 and cap2
The weighted formula is:
For example,
For vid2 "-_aaMGK6GGw_57_61", COS = 2
Thus, the two rows with vid2 comes under this.
fsim FWeight
0 0.253792 0.750000
1 0.192565 0.250000
The calculated value vid_score needs to be
vid_score(1st video) = (fsim[0] * FWeight[0] + fsim[1] * FWeight[1])/(FWeight[0] + FWeight[1])
The expected output value vid_score for vid2 = -_aaMGK6GGw_57_61 is
(0.750000) * (0.253792) + (0.250000) * (0.192565)
= 0.238485 (Final value)
For some videos, this COS = 1, 2, 3, 4, 5, ...
Thus this needs to be dynamic
I am trying to calculate the weighted similarity score for each video ID that is vid2 here. However, there are a number of captions and weights respectively for each video. It varies, some have 2, some 1, some 3, etc. This number of segments and captions has been stored in the feature COS (that is, count of segments).
I want to iterate through the dataframe where score for each video is stored as a weighted average score of the fsim (fragment similarity score). But the number of iteration is not regular.
I have tried this code. But I am not able to iterate dynamically with the iteration factor being COS instead of just a constant value
vems_score = 0.0
video_scores = []
for i, row in merged.iterrows():
vid_score = 0.0
total_weight = 0.0
for j in range(row['COS']):
total_weight = total_weight + row['FWeight']
vid_score = vid_score + (row['FWeight'] * row['fsim'])
i = i + row['COS']
vid_score = vid_score/total_weight
video_scores.append(vid_score)
print(video_scores)
Here is my sol which you can modify/optimize to your needs.
import pandas as pd, numpy as np
def computeSim():
vid=[1,1,2,2,3]
cos=[2,2,2,2,1]
fsim=[0.25,.19,.56,.17,.27]
weight = [.75,.25,.33,.66,.71]
df= pd.DataFrame({'vid':vid,'cos':cos,'fsim':fsim,'fw':weight})
print(df)
df2 = df.groupby('vid')
similarity=[]
for group in df2:
similarity.append( np.sum(group[1]['fsim']*group[1]['fw'])/ np.sum(group[1]['fw']))
return similarity
output:
0.235
0.30000000000000004
0.27
Solution
Try this with your data. I assume that you stored the dataframe as df.
df['Prod'] = df['fsim']*df['FWeight']
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output with your data: Dummy Data (B)
vid2 COS
-_aaMGK6GGw_57_61 2 0.238485
-_hbPLsZvvo_18_25 1 0.275962
-_hbPLsZvvo_5_8 2 0.307548
dtype: float64
Dummy Data: A
I made the following dummy data to test a few aspects of the logic.
df = pd.DataFrame({'vid2': [1,1,2,5,2,6,7,4,8,7,6,2],
'COS': [2,2,3,1,3,2,2,1,1,2,2,3],
'fsim': np.random.rand(12),
'FWeight': np.random.rand(12)})
df['Prod'] = df['fsim']*df['FWeight']
print(df)
# Groupby and apply formula
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output:
vid2 COS
1 2 0.405734
2 3 0.535873
4 1 0.534456
5 1 0.346937
6 2 0.369810
7 2 0.479250
8 1 0.065854
dtype: float64
Dummy Data: B (OP Provided)
This is your dummy data. I made this script so anyone could easily run it and load the data as a dataframe.
import pandas as pd
from io import StringIO
s = """
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
"""
df = pd.read_csv(StringIO(s), sep='\s+')
#print(df)

rolling mean with a moving window

My dataframe has a daily price column and a window size column :
df = pd.DataFrame(columns = ['price', 'window'],
data = [[100, 1],[120, 2], [115, 2], [116, 2], [100, 4]])
df
price window
0 100 1
1 120 2
2 115 2
3 116 2
4 100 4
I would like to compute the rolling mean of price for each row using the window of the window column.
The result would be this :
df
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
I don't find any elegant way to do it with apply and I refuse to loop over each row of my DataFrame...
The best solutions, in terms of raw speed and complexity, are based on ideas from summed-area table. The problem can be consider as a table of one dimension. Below you can find several approaches, ranked from best to worst.
Numpy + Linear complexity
size = len(df['price'])
price = np.zeros(size + 1)
price[1:] = df['price'].values.cumsum()
window = np.clip(np.arange(size) - (df['window'].values - 1), 0, None)
df['rolling_mean_price'] = (price[1:] - price[window]) / df['window'].values
print(df)
Output
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
Loopy + Linear complexity
price = df['price'].values.cumsum()
df['rolling_mean_price'] = [(price[i] - float((i - w) > -1) * price[i-w]) / w for i, w in enumerate(df['window'])]
Loopy + Quadratic complexity
price = df['price'].values
df['rolling_mean_price'] = [price[i - (w - 1):i + 1].mean() for i, w in enumerate(df['window'])]
I would not recommend this approach using pandas.DataFrame.apply() (reasons described here), but if you insist on it, here is one solution:
df['rolling_mean_price'] = df.apply(
lambda row: df.rolling(row.window).price.mean().iloc[row.name], axis=1)
The output looks like this:
>>> print(df)
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75

Fill with the values from neighbor value compering other column in Pandas

I am having dataframe like this:
azimuth id
15 100
15 1
15 100
150 2
150 100
240 3
240 100
240 100
350 100
What I need is to fill instead 100 values values from row where azimuth is the closest:
Desired output:
azimuth id
15 1
15 1
15 1
150 2
150 2
240 3
240 3
240 3
350 1
350 is near to 15 because this is a circle (angle representation). The difference is 25.
What I have:
def mysubstitution(x):
for i in x.index[x['id'] == 100]:
i = int(i)
diff = (x['azimuth'] - x.loc[i, 'azimuth']).abs()
for ind in diff.index:
if diff[ind] > 180:
diff[ind] = 360 - diff[ind]
else:
pass
exclude = [y for y in x.index if y not in x.index[x['id'] == 100]]
closer_idx = diff[exclude]
closer_df = pd.DataFrame(closer_idx)
sorted_df = closer_df.sort_values('azimuth', ascending=True)
try:
a = sorted_df.index[0]
x.loc[i, 'id'] = x.loc[a, 'id']
except Exception as a:
print(a)
return x
Which works ok most of the time, but I guess there is some simpler solution.
Thanks in advance.
I tried to implement the functionality in two steps. First, for each azimuth, I grouped another dataframe that holds their id value(for values other than 100).
Then, using this array I implemented the replaceAzimuth function, which takes each row in the dataframe, first checks if the value already exists. If so, it directly replaces it. Otherwise,it replaces the id value with the closest azimuth value from the grouped dataframe.
Here is the implementation:
df = pd.DataFrame([[15,100],[15,1],[15,100],[150,2],[150,100],[240,3],[240,100],[240,100],[350,100]],columns=['azimuth','id'])
df_non100 = df[df['id'] != 100]
df_grouped = df_non100.groupby(['azimuth'])['id'].min().reset_index()
def replaceAzimuth(df_grouped,id_val):
real_id = df_grouped[df_grouped['azimuth'] == id_val['azimuth']]['id']
if real_id.size == 0:
df_diff = df_grouped
df_diff['azimuth'] = df_diff['azimuth'].apply(lambda x: min(abs(id_val['azimuth'] - x),(360 - id_val['azimuth'] + x)))
id_val['id'] = df_grouped.iloc[df_diff['azimuth'].idxmin()]['id']
else:
id_val['id'] = real_id
return id_val
df = df.apply(lambda x: replaceAzimuth(df_grouped,x), axis = 1)
df
For me, the code seems to give the output you have shown. But not sure if will work on all cases!
First set all ids to nan if they are 100.
df.id = np.where(df.id==100, np.nan, df.id)
Then calculate the angle diff pairwise and find the closest ID to fill the nans.
df.id = df.id.combine_first(
pd.DataFrame(np.abs(((df.azimuth.values[:,None]-df.azimuth.values) +180) % 360 - 180))
.pipe(np.argsort)
.applymap(lambda x: df.id.iloc[x])
.apply(lambda x: x.dropna().iloc[0], axis=1)
)
df
azimuth id
0 15 1.0
1 15 1.0
2 15 1.0
3 150 2.0
4 150 2.0
5 240 3.0
6 240 3.0
7 240 3.0
8 350 1.0

Categories