Python dataframe groupby binning statistics - python

For each "acat" unique value, I want to count how many occurrences there are of each "data" category (call this "bins"), and then calc the mean and skew of "bins"
possible values of data = 1,2,3,4,5
df = pd.DataFrame({'acat':[1,1,2,3,1,3],
'data':[1,1,2,1,3,1]})
df
Out[45]:
acat data
0 1 1
1 1 1
2 2 2
3 3 1
4 1 3
5 3 1
for acat = 1:
bins = (2 + 0 + 1 + 0 + 0)
average = bins / 5 = 0.6
for acat = 2:
bins = (0 + 1 + 0 + 0 + 0)
average = bins / 5 = 0.2
for acat = 3:
bins = (2 + 0 + 0 + 0 + 0)
average = bins / 5 = 0.4
bin_average_col
0.6
0.6
0.2
0.4
0.6
0.4
Also I would like a bin_skew_col.
I have a solution that uses crosstab, but this blows my PC memory when the number of acat is large.
I have tried extensively with groupby and transform but this is beyond me!
Many thanks in advance.

Related

change column value with arthimatic sequences using df.loc in pandas

suppose I have following dataframe :
data = {"age":[2,3,2,5,9,12,20,43,55,60],'alpha' : [0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
I want to change value of column alpha based on column age using df.loc and an arithmetic sequences but I got syntax error:
df.loc[((df.age <=4)) , "alpha"] = ".4"
df.loc[((df.age >= 5)) & ((df.age <= 20)), "alpha"] = 0.4 + (1 - 0.4)*((df$age - 4)/(20 - 4))
df.loc[((df.age > 20)) , "alpha"] = "1"
thank you in davance.
Reference the age column using a . not a $
df.loc[((df.age >= 5)) & ((df.age <= 20)), "alpha"] = 0.4 + (1 - 0.4)*((df.age - 4)/(20 - 4))
Instead of multiple .loc assignments you can combine all conditions at once using chained np.where clauses:
df['alpha'] = np.where(df.age <= 4, ".4", np.where((df.age >= 5) & (df.age <= 20),
0.4 + (1 - 0.4) *((df.age - 4)/(20 - 4)),
np.where(df.age > 20, "1", df.alpha)))
print(df)
age alpha
0 2 .4
1 3 .4
2 2 .4
3 5 0.4375
4 9 0.5875
5 12 0.7
6 20 1.0
7 43 1
8 55 1
9 60 1
Besides the synthax error (due to $), to reduce visible noise, I would go for numpy.select :
import numpy as np
​
conditions = [df["age"].le(4),
df["age"].gt(4) & df["age"].le(20),
df["age"].gt(20)]
​
values = [".4", 0.4 + (1 - 0.4) * ((df["age"] - 4) / (20 - 4)), 1]
​
df["alpha"] = np.select(condlist= conditions, choicelist= values)
​
Output :
print(df)
age alpha
0 2 .4
1 3 .4
2 2 .4
3 5 0.4375
4 9 0.5875
5 12 0.7
6 20 1.0
7 43 1
8 55 1
9 60 1

Don't want to use for loop in Pandas. How can I use Apply in this case?

I know that for loops is not a good thing in Pandas. apply could be better. But I found it is hard to use apply in my quesiton.
data = {'A':[1,1,1,2,2], 'B':[2018,2019,2020,2019,2020],'PR':[12,10,0,24,20],'WP':[300,0,0,300,0],'BD':[6,5,0,2,1],'i':[1,2,1,1,2],'r':[0.5,0.25,0,0.5,0.25]}
df = pd.DataFrame(data)
df['X'] = 0
df['Y'] = 0
df['Z'] = 0
The original dataframe:
[]
My aim is:
Divide the df to two groups, according to A.
For each group, calculate the X Y and Z
X = (Z in last year + PR in current year) * i in this year
Y = Z in last year + WP movement from last year to this year + BD movement from last year to this year + X in this year
Z = Y in this year * r in this year.
The following is my code, it works well. But I don't want to use for loop. Are there any better methods?
# divide the df to two groups
sub_df = [df[df['A'].isin([i])] for i in np.unique(df['A'])]
a = []
for df in sub_df:
df = df.copy()
df.loc[-1] = [0]*df.shape[1] #add a 0 row to calculate the first year.
df.sort_index(inplace = True)
df.reset_index(drop=True, inplace=True)
for n in range(1,df.shape[0]):
df.loc[n,'X'] = (df.loc[n-1,'Z'] + df.loc[n,'PR']) * df.loc[n,'i']
df.loc[n,'Y'] = df.loc[n-1,'Z'] + df.loc[n,'WP'] - df.loc[n-1,'WP'] + df.loc[n,'BD'] - df.loc[n-1,'BD'] + df.loc[n,'X']
df.loc[n,'Z'] = df.loc[n,'Y'] * df.loc[n,'r']
a.append(df[1:])
b = pd.concat(a)
b
I don't know any other method that isn't using for-loop, but I managed to simplify the code a little bit:
def get_df(df):
X, Y, Z = [], [], []
prev_Z, prev_WP, prev_BD = 0, 0, 0
for pr, i, wp, bd, r in zip(df["PR"], df["i"], df["WP"], df["BD"], df["r"]):
X.append((prev_Z + pr) * i)
Y.append(prev_Z + wp - prev_WP + bd - prev_BD + X[-1])
Z.append(Y[-1] * r)
prev_Z, prev_WP, prev_BD = Z[-1], wp, bd
df["X"] = X
df["Y"] = Y
df["Z"] = Z
return df
out = df.groupby("A").apply(get_df)
print(out)
Prints:
A B PR WP BD i r X Y Z
0 1 2018 12 300 6 1 0.50 12.0 318.0 159.0
1 1 2019 10 0 5 2 0.25 338.0 196.0 49.0
2 1 2020 0 0 0 1 0.00 49.0 93.0 0.0
3 2 2019 24 300 2 1 0.50 24.0 326.0 163.0
4 2 2020 20 0 1 2 0.25 366.0 228.0 57.0
according my timing, your code takes 0.004833603044971824 seconds to complete, my version 0.0019200179958716035 seconds, so ~2.5x faster.

Creation Variations of Pandas DataFrame with Column Dependencies

I want to be able to expand my DataFrame to incorporate other scenarios. For example, for a DataFrame capturing active users per company, I want to add a scenario where active users increase but do not exceed the total user count.
Example input:
Example output:
I tried using a loop but quite inefficiently, yielding odd results:
while df[df['active_users'] + add_users <= df['total_users']].any():
df[(df['active_users'] + add_users) <= df["total_users"]]['active_users'] = (df['active_users'] + add_users).astype(int)
add_users += 1
Use Index.repeat with DataFrame.loc and for counters use GroupBy.cumcount:
df1 = df.loc[df.index.repeat(df['inactive_users'] + 1)]
df1['inactive_users'] = df1.groupby(level=0).cumcount(ascending=False)
s = df1.groupby(level=0).cumcount()
df1['active_users'] += s
df1['company'] = (df1['company'] + ' + ' + s.astype(str).replace('0','')).str.strip(' +')
print (df1)
company contract total_users active_users inactive_users
0 A 10000 10 7 3
0 A + 1 10000 10 8 2
0 A + 2 10000 10 9 1
0 A + 3 10000 10 10 0
1 B 7500 5 4 1
1 B + 1 7500 5 5 0

pandas product of a column with its index groupby

I am working with a dataframe, and had to do a groupby in order to make some operations on my data.
This is an example of my Dataframe:
I SI deltas
1 10 0.1
1 14 0.1
2 10 0.1
2 18 0.3
1 17 0.05
2 30 0.3
1 10 0.4
1 14 0.2
2 10 0.1
2 18 0.2
1 17 0.15
Now, for each I, I count the relative frequency of the SI in this way:
results = df.groupby(['I', 'SI'])[['deltas']].sum()
#for each I, we sum all the weights (Deltas)
denom = results.groupby('I')['deltas'].sum()
#for each I, we divide each deltas by the sum, getting them normalized to one
results.deltas = results.deltas / denom
So my Dataframe now looks like this:
I = 1
deltas
SI = 10 0.5
SI = 14 0.3
SI = 17 0.2
I = 2
deltas
SI = 10 0.2
SI = 18 0.5
SI = 30 0.3
....
What I need to do is to print for each I the sum of deltas times their relative SI:
I = 1 sum = 0.5 * 10 + 0.3*14 + 0.2*17 = 12.6
I = 2 sum = 0.2*10 + 18*0.5 + 30*0.3 = 21
But since now I am working with a dataframe where the indices are I and SI, I do not know how to use them. I tried this code:
for idx2, j in enumerate(results.index.get_level_values(0).unique()):
#print results.loc[j]
f.write("%d\t"%(j)+results.loc[j].to_string(index=False)+'\n')
but I am not sure how should I proceed to get the indices values
Let's assume you have an input dataframe df following your initial transformations. If SI is your index, elevate it to a column via df = df.reset_index() as an initial step.
I SI weight
0 1 10 0.5
1 1 14 0.3
2 1 17 0.2
3 2 10 0.2
4 2 18 0.5
5 2 30 0.3
You can then calculate the product of SI and weight, then use GroupBy + sum:
res = df.assign(prod=df['SI']*df['weight'])\
.groupby('I')['prod'].sum().reset_index()
print(res)
I prod
0 1 12.6
1 2 20.0
For a single dataframe in isolation, you can use np.dot for the dot product.
s = pd.Series([0.5, 0.3, 0.2], index=[10, 14, 17])
s.index.name = 'SI'
res = np.dot(s.index, s) # 12.6

How to 'unravel' a dataframe using the values of the rows?

I have a DataFrame that looks like this:
class passed failed extra_teaching
A11 1 2 0.5
A12 2 1 0.7
I want to 'unravel' the DataFrame, and lose the information about the class but keep the information on extra_teaching, so I end up with a row for each individual pupil who passed.
So the DataFrame should end up looking like this:
pass extra_teaching
1 0.5
0 0.5
0 0.5
1 0.7
1 0.7
0 0.7
I have no idea how to do this in pandas, except perhaps by using iterrows() and manually appending rows to a new DataFrame - has anyone got a neater way?
UPDATE:
I tried this, seems to work though not very elegant:
temp = []
df = df.set_index('class')
for idx in df.index:
row = df.loc[idx]
t = {'class': idx, 'extra_teaching': row['extra_teaching']}
for i in range(0, int(row['passed'])):
t['pass'] = 1
temp.append(t)
for i in range(0, int(row['failed'])):
t['pass'] = 0
temp.append(t)
df_exploded = pd.DataFrame(temp)
Try:
def teaching_results(x):
num_rows = x.passed.iloc[0] + x.failed.iloc[0]
passed = x.passed.iloc[0] * [1] + x.failed.iloc[0] * [0]
extra_teaching = num_rows * [x.extra_teaching.iloc[0]]
class_code = x['class'].iloc[0]
return pd.DataFrame({'pass': passed, 'extra_teaching': extra_teaching, 'class': class_code})
df.groupby('class', as_index=False).apply(lambda x: teaching_results(x))
to get:
class extra_teaching pass
0 0 A11 0.5 1
1 A11 0.5 0
2 A11 0.5 0
1 0 A12 0.7 1
1 A12 0.7 1
2 A12 0.7 0

Categories