I am working with a dataframe, and had to do a groupby in order to make some operations on my data.
This is an example of my Dataframe:
I SI deltas
1 10 0.1
1 14 0.1
2 10 0.1
2 18 0.3
1 17 0.05
2 30 0.3
1 10 0.4
1 14 0.2
2 10 0.1
2 18 0.2
1 17 0.15
Now, for each I, I count the relative frequency of the SI in this way:
results = df.groupby(['I', 'SI'])[['deltas']].sum()
#for each I, we sum all the weights (Deltas)
denom = results.groupby('I')['deltas'].sum()
#for each I, we divide each deltas by the sum, getting them normalized to one
results.deltas = results.deltas / denom
So my Dataframe now looks like this:
I = 1
deltas
SI = 10 0.5
SI = 14 0.3
SI = 17 0.2
I = 2
deltas
SI = 10 0.2
SI = 18 0.5
SI = 30 0.3
....
What I need to do is to print for each I the sum of deltas times their relative SI:
I = 1 sum = 0.5 * 10 + 0.3*14 + 0.2*17 = 12.6
I = 2 sum = 0.2*10 + 18*0.5 + 30*0.3 = 21
But since now I am working with a dataframe where the indices are I and SI, I do not know how to use them. I tried this code:
for idx2, j in enumerate(results.index.get_level_values(0).unique()):
#print results.loc[j]
f.write("%d\t"%(j)+results.loc[j].to_string(index=False)+'\n')
but I am not sure how should I proceed to get the indices values
Let's assume you have an input dataframe df following your initial transformations. If SI is your index, elevate it to a column via df = df.reset_index() as an initial step.
I SI weight
0 1 10 0.5
1 1 14 0.3
2 1 17 0.2
3 2 10 0.2
4 2 18 0.5
5 2 30 0.3
You can then calculate the product of SI and weight, then use GroupBy + sum:
res = df.assign(prod=df['SI']*df['weight'])\
.groupby('I')['prod'].sum().reset_index()
print(res)
I prod
0 1 12.6
1 2 20.0
For a single dataframe in isolation, you can use np.dot for the dot product.
s = pd.Series([0.5, 0.3, 0.2], index=[10, 14, 17])
s.index.name = 'SI'
res = np.dot(s.index, s) # 12.6
Related
suppose I have following dataframe :
data = {"age":[2,3,2,5,9,12,20,43,55,60],'alpha' : [0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
I want to change value of column alpha based on column age using df.loc and an arithmetic sequences but I got syntax error:
df.loc[((df.age <=4)) , "alpha"] = ".4"
df.loc[((df.age >= 5)) & ((df.age <= 20)), "alpha"] = 0.4 + (1 - 0.4)*((df$age - 4)/(20 - 4))
df.loc[((df.age > 20)) , "alpha"] = "1"
thank you in davance.
Reference the age column using a . not a $
df.loc[((df.age >= 5)) & ((df.age <= 20)), "alpha"] = 0.4 + (1 - 0.4)*((df.age - 4)/(20 - 4))
Instead of multiple .loc assignments you can combine all conditions at once using chained np.where clauses:
df['alpha'] = np.where(df.age <= 4, ".4", np.where((df.age >= 5) & (df.age <= 20),
0.4 + (1 - 0.4) *((df.age - 4)/(20 - 4)),
np.where(df.age > 20, "1", df.alpha)))
print(df)
age alpha
0 2 .4
1 3 .4
2 2 .4
3 5 0.4375
4 9 0.5875
5 12 0.7
6 20 1.0
7 43 1
8 55 1
9 60 1
Besides the synthax error (due to $), to reduce visible noise, I would go for numpy.select :
import numpy as np
conditions = [df["age"].le(4),
df["age"].gt(4) & df["age"].le(20),
df["age"].gt(20)]
values = [".4", 0.4 + (1 - 0.4) * ((df["age"] - 4) / (20 - 4)), 1]
df["alpha"] = np.select(condlist= conditions, choicelist= values)
Output :
print(df)
age alpha
0 2 .4
1 3 .4
2 2 .4
3 5 0.4375
4 9 0.5875
5 12 0.7
6 20 1.0
7 43 1
8 55 1
9 60 1
I know that for loops is not a good thing in Pandas. apply could be better. But I found it is hard to use apply in my quesiton.
data = {'A':[1,1,1,2,2], 'B':[2018,2019,2020,2019,2020],'PR':[12,10,0,24,20],'WP':[300,0,0,300,0],'BD':[6,5,0,2,1],'i':[1,2,1,1,2],'r':[0.5,0.25,0,0.5,0.25]}
df = pd.DataFrame(data)
df['X'] = 0
df['Y'] = 0
df['Z'] = 0
The original dataframe:
[]
My aim is:
Divide the df to two groups, according to A.
For each group, calculate the X Y and Z
X = (Z in last year + PR in current year) * i in this year
Y = Z in last year + WP movement from last year to this year + BD movement from last year to this year + X in this year
Z = Y in this year * r in this year.
The following is my code, it works well. But I don't want to use for loop. Are there any better methods?
# divide the df to two groups
sub_df = [df[df['A'].isin([i])] for i in np.unique(df['A'])]
a = []
for df in sub_df:
df = df.copy()
df.loc[-1] = [0]*df.shape[1] #add a 0 row to calculate the first year.
df.sort_index(inplace = True)
df.reset_index(drop=True, inplace=True)
for n in range(1,df.shape[0]):
df.loc[n,'X'] = (df.loc[n-1,'Z'] + df.loc[n,'PR']) * df.loc[n,'i']
df.loc[n,'Y'] = df.loc[n-1,'Z'] + df.loc[n,'WP'] - df.loc[n-1,'WP'] + df.loc[n,'BD'] - df.loc[n-1,'BD'] + df.loc[n,'X']
df.loc[n,'Z'] = df.loc[n,'Y'] * df.loc[n,'r']
a.append(df[1:])
b = pd.concat(a)
b
I don't know any other method that isn't using for-loop, but I managed to simplify the code a little bit:
def get_df(df):
X, Y, Z = [], [], []
prev_Z, prev_WP, prev_BD = 0, 0, 0
for pr, i, wp, bd, r in zip(df["PR"], df["i"], df["WP"], df["BD"], df["r"]):
X.append((prev_Z + pr) * i)
Y.append(prev_Z + wp - prev_WP + bd - prev_BD + X[-1])
Z.append(Y[-1] * r)
prev_Z, prev_WP, prev_BD = Z[-1], wp, bd
df["X"] = X
df["Y"] = Y
df["Z"] = Z
return df
out = df.groupby("A").apply(get_df)
print(out)
Prints:
A B PR WP BD i r X Y Z
0 1 2018 12 300 6 1 0.50 12.0 318.0 159.0
1 1 2019 10 0 5 2 0.25 338.0 196.0 49.0
2 1 2020 0 0 0 1 0.00 49.0 93.0 0.0
3 2 2019 24 300 2 1 0.50 24.0 326.0 163.0
4 2 2020 20 0 1 2 0.25 366.0 228.0 57.0
according my timing, your code takes 0.004833603044971824 seconds to complete, my version 0.0019200179958716035 seconds, so ~2.5x faster.
I have a dataframe A that looks like this
value Frequency
0.1 3
0.2 2
and I want to convert it to dataframe B like below
Sample
0.1
0.1
0.1
0.2
0.2
Simply put, dataframe A is the samples and their frequency (repetition). Dataframe B is literally expanding that. Is there a straightforward way to do this?
what I did (minimal working example reproducing above):
X = pd.DataFrame([(0.1,3),(0.2,2)],columns=['value','Frequency'])
Sample = list()
for index, row in X.iterrows():
Value = row['value']
Freq = int(row['Frequency'])
Sample = Sample + [Value]*Freq
Data = pd.DataFrame({'Sample':pd.Series(Sample)})
You can use Series.repeat, where the repeats argument can also be a series of ints:
df.value.repeat(df.Frequency).reset_index(drop=True).to_frame('Sample')
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
Use repeat
>>> df['value'].repeat(df.Frequency)
0 0.1
0 0.1
0 0.1
1 0.2
1 0.2
Name: value, dtype: float64
Or Create new dataframe with
>>> pd.DataFrame(df['value'].repeat(df.Frequency).to_numpy(),columns=["Sample"])
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
You can use reindex + repeat
X = X.reindex(X.index.repeat(X.Frequency))
For each "acat" unique value, I want to count how many occurrences there are of each "data" category (call this "bins"), and then calc the mean and skew of "bins"
possible values of data = 1,2,3,4,5
df = pd.DataFrame({'acat':[1,1,2,3,1,3],
'data':[1,1,2,1,3,1]})
df
Out[45]:
acat data
0 1 1
1 1 1
2 2 2
3 3 1
4 1 3
5 3 1
for acat = 1:
bins = (2 + 0 + 1 + 0 + 0)
average = bins / 5 = 0.6
for acat = 2:
bins = (0 + 1 + 0 + 0 + 0)
average = bins / 5 = 0.2
for acat = 3:
bins = (2 + 0 + 0 + 0 + 0)
average = bins / 5 = 0.4
bin_average_col
0.6
0.6
0.2
0.4
0.6
0.4
Also I would like a bin_skew_col.
I have a solution that uses crosstab, but this blows my PC memory when the number of acat is large.
I have tried extensively with groupby and transform but this is beyond me!
Many thanks in advance.
Was wondering if there is a better way to get the probability of a 2D numpy array. Maybe using some of numpy's built in functions.
For simplicity, say we have an example array:
[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]
Would like to get the probability such as:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
Where 'juice' as the second word has a probabiliy of 0.2. Since apple has probability of 2/5 * 1/2 (for juice).
On the other hand, 'pie' as a second word, has a probability of 0.4. The combination of the probability from 'apple' and 'orange'.
The way I approached the problem was adding 3 new columns to the array, for the probability of 1st column, 2nd column, and the final probability. Group the array per 1st column, then per 2nd column and update the probability accordingly.
Below is my code:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
Outputs:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
Was wondering if there is a better short/faster way of doing this using numpy or other means. Adding columns is not necessary, this was just my way of doing it. Other approach will be acceptable.
Based on the definition of probability distribution you have given, you can use pandas to do the same i.e
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
Output:
0 1 2 3 4
0 apple pie 0.4 0.5 0.2
1 apple juice 0.4 0.5 0.2
2 orange pie 0.2 1.0 0.2
3 strawberry cream 0.4 0.5 0.2
4 strawberry candy 0.4 0.5 0.2
If you want that in the form of list you can use df.values.tolist()
If you dont want the columns then
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
Output:
0 1 2
0 apple pie 0.2
1 apple juice 0.2
2 orange pie 0.2
3 strawberry cream 0.2
4 strawberry candy 0.2
For combined probablity print(df.groupby(1)[2].sum())
candy 0.2
cream 0.2
juice 0.2
pie 0.4