Calculating percentage in Python Pandas library - python

I have a Pandas dataframe like this:
import pandas as pd
df = pd.DataFrame(
{'gender':['F','F','F','F','F','M','M','M','M','M'],
'mature':[0,1,0,0,0,1,1,1,0,1],
'cta' :[1,1,0,1,0,0,0,1,0,1]}
)
df['gender'] = df['gender'].astype('category')
df['mature'] = df['mature'].astype('category')
df['cta'] = pd.to_numeric(df['cta'])
df
I calculated the sum (How many times people clicked) and total (the number of sent messages). I want to figure out how to calculate the percentage defined as clicks/total and how to get a dataframe as output.
temp_groupby = df.groupby('gender').agg({'cta': [('clicks','sum'),
('total','count')]})
temp_groupby

I think it means you need average, add new tuple to list like:
temp_groupby = df.groupby('gender').agg({'cta': [('clicks','sum'),
('total','count'),
('perc', 'mean')]})
print (temp_groupby)
cta
clicks total perc
gender
F 3 5 0.6
M 2 5 0.4
For avoid MultiIndex in columns specify column after groupby:
temp_groupby = df.groupby('gender')['cta'].agg([('clicks','sum'),
('total','count'),
('perc', 'mean')]).reset_index()
print (temp_groupby)
gender clicks total perc
0 F 3 5 0.6
1 M 2 5 0.4
Or use named aggregation:
temp_groupby = df.groupby('gender', as_index=False).agg(clicks= ('cta','sum'),
total= ('cta','count'),
perc= ('cta','mean'))
print (temp_groupby)
gender clicks total perc
0 F 3 5 0.6
1 M 2 5 0.4

Related

Average score per attempt for entries with non fully overlapping attempts

I have a pandas dataframe that has a column that contains a list of attempt numbers and another column that contains the score achieved on those attempts. A simplified example is below:
scores = [[0,1,0], [0,0], [0,6,2]]
attempt_num = [[1,2,3], [2,4], [2,3,4]]
df = pd.DataFrame([attempt_num, scores]).T
df.columns = ['Attempt', 'Score']
Each row represents a different person, which for the purposes of this question, we can assume are unique. The data is incomplete, and so I have attempt number 1, 2 and 3 for the first person, 2 and 4 for the second and 2, 3 and 4 for the last. What I want to do is to get an average score per attempt. For example, attempt 1 only shows up once and so the average would be 0, the score achieved when it did show up. Attempt 2 shows up for all persons which gives an average of 0.33 ((1 + 0 + 0)/3) and so on. So the expected output would be:
Attempt_Number Average_Score
0 1 0.00
1 2 0.33
2 3 3.00
3 4 1.00
I could loop through every element of row of the dataframe and then through every element in the list in that row, append the score to an ordered list and calculate the average for every element in that list, but this would seem to be very inefficient. Is there a better way?
Use DataFrame.explode with aggregate mean:
df = (df.explode(['Number','Score'])
.astype({'Score':int})
.groupby('Attempt', as_index=False)['Score']
.mean()
.rename(columns={'Attempt':'Attempt_Number','Score':'Average_Score'})
)
print (df)
Attempt_Number Average_Score
0 1 0.000000
1 2 0.333333
2 3 3.000000
3 4 1.000000
For oldier pandas versions use:
df = (df.apply(pd.Series.explode)
.astype({'Score':int})
.groupby('Attempt', as_index=False)['Score']
.mean()
.rename(columns={'Attempt':'Attempt_Number','Score':'Average_Score'})
)

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

Get the percentile of a column ordered by another column

I have a dataframe with two columns, score and order_amount. I want to find the score Y that represents the Xth percentile of order_amount. I.e. if I sum up all of the values of order_amount where score <= Y I will get X% of the total order_amount.
I have a solution below that works, but it seems like there should be a more elegant way with pandas.
import pandas as pd
test_data = {'score': [0.3,0.1,0.2,0.4,0.8],
'value': [10,100,15,200,150]
}
df = pd.DataFrame(test_data)
df
score value
0 0.3 10
1 0.1 100
2 0.2 15
3 0.4 200
4 0.8 150
# Now we can order by `score` and use `cumsum` to calculate what we want
df_order = df.sort_values('score')
df_order['percentile_value'] = 100*df_order['value'].cumsum()/df_order['value'].sum()
df_order
score value percentile_value
1 0.1 100 21.052632
2 0.2 15 24.210526
0 0.3 10 26.315789
3 0.4 200 68.421053
4 0.8 150 100.000000
# Now can find the first value of score with percentile bigger than 50% (for example)
df_order[df_order['percentile_value']>50]['score'].iloc[0]
Use Series.searchsorted:
idx = df_order['percentile_value'].searchsorted(50)
print (df_order.iloc[idx, df.columns.get_loc('score')])
0.4
Or get first value of filtered Series with next and iter, if no match returned some default value:
s = df_order.loc[df_order['percentile_value'] > 50, 'score']
print (next(iter(s), 'no match'))
0.4
One line solution:
out = next(iter((df.sort_values('score')
.assign(percentile_value = lambda x: 100*x['value'].cumsum()/x['value'].sum())
.query('percentile_value > 50')['score'])),'no matc')
print (out)
0.4
here is another way starting from the oriinal dataframe using np.percentile:
df = df.sort_values('score')
df.loc[np.searchsorted(df['value'],np.percentile(df['value'].cumsum(),50)),'score']
Or series.quantile
df.loc[np.searchsorted(df['value'],df['value'].cumsum().quantile(0.5)),'score']
Or similarly with iloc, if index is not default:
df.iloc[np.searchsorted(df['value']
,np.percentile(df['value'].cumsum(),50)),df.columns.get_loc('score')]
0.4

Unpack DataFrame with tuple entries into separate DataFrames

I wrote a small class to compute some statistics through bootstrap without replacement. For those not familiar with this technique, you get n random subsamples of some data, compute the desired statistic (lets say the median) on each subsample, and then compare the values across subsamples. This allows you to get a measure of variance on the obtained median over the dataset.
I implemented this in a class but reduced it to a MWE given by the following function
import numpy as np
import pandas as pd
def bootstrap_median(df, n=5000, fraction=0.1):
if isinstance(df, pd.DataFrame):
columns = df.columns
else:
columns = None
# Get the values as a ndarray
arr = np.array(df.values)
# Get the bootstrap sample through random permutations
sample_len = int(len(arr)*fraction)
if sample_len<1:
sample_len = 1
sample = []
for n_sample in range(n):
sample.append(arr[np.random.permutation(len(arr))[:sample_len]])
sample = np.array(sample)
# Compute the median on each sample
temp = np.median(sample, axis=1)
# Get the mean and std of the estimate across samples
m = np.mean(temp, axis=0)
s = np.std(temp, axis=0)/np.sqrt(len(sample))
# Convert output to DataFrames if necesary and return
if columns:
m = pd.DataFrame(data=m[None, ...], columns=columns)
s = pd.DataFrame(data=s[None, ...], columns=columns)
return m, s
This function returns the mean and standard deviation across the medians computed on each bootstrap sample.
Now consider this example DataFrame
data = np.arange(20)
group = np.tile(np.array([1, 2]).reshape(-1,1), (1,10)).flatten()
df = pd.DataFrame.from_dict({'data': data, 'group': group})
print(df)
print(bootstrap_median(df['data']))
this prints
data group
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
(9.5161999999999995, 0.056585753613431718)
So far so good because bootstrap_median returns a tuple of two elements. However, if I do this after a groupby
In: df.groupby('group')['data'].apply(bootstrap_median)
Out:
group
1 (4.5356, 0.0409710449952)
2 (14.5006, 0.0403772204095)
The values inside each cell are tuples, as one would expect from apply. I can unpack the result into two DataFrame's by iterating over elements like this:
index = []
data1 = []
data2 = []
for g, (m, s) in out.iteritems():
index.append(g)
data1.append(m)
data2.append(s)
dfm = pd.DataFrame(data=data1, index=index, columns=['E[median]'])
dfm.index.name = 'group'
dfs = pd.DataFrame(data=data2, index=index, columns=['std[median]'])
dfs.index.name = 'group'
thus
In: dfm
Out:
E[median]
group
1 4.5356
2 14.5006
In: dfs
Out:
std[median]
group
1 0.0409710449952
2 0.0403772204095
This is a bit cumbersome and my question is if there is a more pandas native way to "unpack" a dataframe whose values are tuples into separate DataFrame's
This question seemed related but it concerned string regex replacements and not unpacking true tuples.
I think you need change:
return m, s
to:
return pd.Series([m, s], index=['m','s'])
And then get:
df1 = df.groupby('group')['data'].apply(bootstrap_median)
print (df1)
group
1 m 4.480400
s 0.040542
2 m 14.565200
s 0.040373
Name: data, dtype: float64
So is possible select by xs:
print (df1.xs('s', level=1))
group
1 0.040542
2 0.040373
Name: data, dtype: float64
print (df1.xs('m', level=1))
group
1 4.4804
2 14.5652
Name: data, dtype: float64
Also if need one column DataFrame add to_frame:
df1 = df.groupby('group')['data'].apply(bootstrap_median).to_frame()
print (df1)
data
group
1 m 4.476800
s 0.041100
2 m 14.468400
s 0.040719
print (df1.xs('s', level=1))
data
group
1 0.041100
2 0.040719
print (df1.xs('m', level=1))
data
group
1 4.4768
2 14.4684

Setting the content of a pandas DataFrame cell based on the values of other columns cells

I have a pandas DataFrame df with the following content:
Serial N voltage current average
B 10 2
B 10 2
C 12 0.7
D 40 0.5
. . .
AB 10 3
AB 10 3
I would like to have the column "average" have the the average of the column current for which they have the same voltage. Otherwise they should keep the same value of the current. For example, I would like my dataFrame to have something like this.
Serial N voltage current average
B 10 2 2.5
B 10 2 2.5
C 12 0.7 0.7
D 40 0.5 0.5
. . .
AB 10 3 2.5
AB 10 3 2.5
The Serial N column B and AB have the same voltage, therefore, their average contains average of each of the Serial N with the same voltage. How can I tackle this problem without using a loop if possible?
You can use pandas groupby function to get the averages. You then need to merge it with the rest of the data frame. Have a look at the result of each line to see what it does.
averages = df.groupby('voltage').mean()
# rename the column so it's obvious what it is
averages.columns = ['average current']
averages = averages.reset_index()
df = df.merge(averages, how='left', on='voltage')
Have a look at the documentation on grouping, it should give you some hints for problems like this

Categories