pandas groupby two columns and summarize by mean - python

I have a data frame like this:
df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)
id view value
0 1 A 0.625781
1 1 B 0.330084
2 1 A 0.024532
3 2 A 0.154651
4 2 B 0.196960
5 3 A 0.393941
6 3 B 0.607217
7 3 A 0.422823
8 3 A 0.994323
9 4 B 0.366650
10 4 A 0.649585
11 5 B 0.513923
I now want to summarize for each id each view by mean of 'value'.
Think of this as some ids have repeated observations for view, and I want to summarize them. For example, id 1 has two observations for A.
I tried
res = df.groupby(['id', 'view'])['value'].mean()
This actually almost what I want, but pandas combines the id and view column into one, which I do not want.
id view
1 A 0.325157
B 0.330084
2 A 0.154651
B 0.196960
3 A 0.603696
B 0.607217
4 A 0.649585
B 0.366650
5 B 0.513923
also res.shape is of dimension (9,)
my desired output would be this:
id view value
1 A 0.325157
1 B 0.330084
2 A 0.154651
2 B 0.196960
3 A 0.603696
3 B 0.607217
4 A 0.649585
4 B 0.366650
5 B 0.513923
where the column names and dimensions are kept and where the id is repeated. Each id should have only 1 row for A and B.
How can I achieve this?

You need reset_index or parameter as_index=False in groupby, because you get MuliIndex and by default the higher levels of the indexes are sparsified to make the console output a bit easier on the eyes:
np.random.seed(100)
df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)
print (df)
id view value
0 1 A 0.543405
1 1 B 0.278369
2 1 A 0.424518
3 2 A 0.844776
4 2 B 0.004719
5 3 A 0.121569
6 3 B 0.670749
7 3 A 0.825853
8 3 A 0.136707
9 4 B 0.575093
10 4 A 0.891322
11 5 B 0.209202
res = df.groupby(['id', 'view'])['value'].mean().reset_index()
print (res)
id view value
0 1 A 0.483961
1 1 B 0.278369
2 2 A 0.844776
3 2 B 0.004719
4 3 A 0.361376
5 3 B 0.670749
6 4 A 0.891322
7 4 B 0.575093
8 5 B 0.209202
res = df.groupby(['id', 'view'], as_index=False)['value'].mean()
print (res)
id view value
0 1 A 0.483961
1 1 B 0.278369
2 2 A 0.844776
3 2 B 0.004719
4 3 A 0.361376
5 3 B 0.670749
6 4 A 0.891322
7 4 B 0.575093
8 5 B 0.209202

Related

Create multiple columns with Pandas .apply()

I have two pandas DataFrames, both containing the same categories but different 'id' columns. In order to illustrate, the first table looks like this:
df = pd.DataFrame({
'id': list(np.arange(1, 12)),
'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c'],
'weight': list(np.random.randint(1, 5, 11))
})
df['weight_sum'] = df.groupby('category')['weight'].transform('sum')
df['p'] = df['weight'] / df['weight_sum']
Output:
id category weight weight_sum p
0 1 a 4 14 0.285714
1 2 a 4 14 0.285714
2 3 a 2 14 0.142857
3 4 a 4 14 0.285714
4 5 b 4 8 0.500000
5 6 b 4 8 0.500000
6 7 c 3 15 0.200000
7 8 c 4 15 0.266667
8 9 c 2 15 0.133333
9 10 c 4 15 0.266667
10 11 c 2 15 0.133333
The second contains only 'id' and 'category'.
What I'm trying to do is to create a third DataFrame, that would have inherit the id of the second DataFrame, plus three new columns for the ids of the first DataFrame - each should be selected based on the p column, which represents its weight within that category.
I've tried multiple solutions and was thinking of applying np.random.choice and .apply(), but couldn't figure out a way to make that work.
EDIT:
The desired output would be something like:
user_id id_1 id_2 id_3
0 2 3 1 2
1 3 2 2 3
2 4 1 3 1
With each id being selected based on the its probability and respective category (both DataFrames have this column), and the same not showing up more than once for the same user_id.
Desired DataFrame
IIUC, you want to select random IDs of the same category with weighted probabilities. For this you can construct a helper dataframe (dfg) and use apply:
df2 = pd.DataFrame({
'id': np.random.randint(1, 12, size=11),
'category': ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c']})
dfg = df.groupby('category').agg(list)
df3 = df2.join(df2['category']
.apply(lambda r: pd.Series(np.random.choice(dfg.loc[r, 'id'],
p=dfg.loc[r, 'p'],
size=3)))
.add_prefix('id_')
)
Output:
id category id_0 id_1 id_2
0 11 a 2 3 3
1 10 a 2 3 1
2 4 a 1 2 3
3 7 a 2 1 4
4 5 b 6 5 5
5 10 b 6 5 6
6 8 c 9 8 8
7 11 c 7 8 7
8 11 c 10 8 8
9 4 c 9 10 10
10 1 c 11 11 9

How to remove NaNs and squeeze in a DataFrame - pandas

I was doing some coding and realized something, I think there is an easier way of doing this.
So I have a DataFrame like this:
>>> df = pd.DataFrame({'a': [1, 'A', 2, 'A'], 'b': ['A', 3, 'A', 4]})
a b
0 1 A
1 A 3
2 2 A
3 A 4
And I want to remove all of the As from the data, but I also want to squeeze in the DataFrame, what I mean by squeezing in the DataFrame is to have a result of this:
a b
0 1 3
1 2 4
I have a solution as follows:
a = df['a'][df['a'] != 'A']
b = df['b'][df['b'] != 'A']
df2 = pd.DataFrame({'a': a.tolist(), 'b': b.tolist()})
print(df2)
Which works, but I seem to think there is an easier way, I've stopped coding for a while so not so bright anymore...
Note:
All columns have the same amount of As, there is no problem there.
You can try boolean indexing with loc to remove the A values:
pd.DataFrame({c: df.loc[df[c] != 'A', c].tolist() for c in df})
Result:
a b
0 1 3
1 2 4
This would do:
In [1513]: df.replace('A', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out[1513]:
a b
0 1.0 3.0
1 2.0 4.0
We use can df.melt then filter out 'A' values then df.pivot
out = df.melt().query("value!='A'")
out.index = out.groupby('variable')['variable'].cumcount()
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Details
out = df.melt().query("value!='A'")
variable value
0 a 1
2 a 2
5 b 3
7 b 4
# We set this as index so it helps in `df.pivot`
out.groupby('variable')['variable'].cumcount()
0 0
2 1
5 0
7 1
dtype: int64
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Another alternative
df = df.mask(df.eq('A'))
out = df.stack()
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4
Details
df = df.mask(df.eq('A'))
a b
0 1 NaN
1 NaN 3
2 2 NaN
3 NaN 4
out = df.stack()
0 a 1
1 b 3
2 a 2
3 b 4
dtype: object
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4

How to change value in columns 4,5,6 of a dataframe to percentage format?

I use the following code to try to change value in columns 4,5,6 of a dataframe to percentage format but it returned me the errors.
df.iloc[:,4:7].apply('{:.2%}'.format)
You can use DataFrame.applymap:
df = pd.DataFrame({
'a':list('abcdef'),
'b':list('aaabbb'),
'c':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'e':[1,3,5,7,1,0],
'e':[5,3,6,9,2,4],
'f':[7,8,9,4,2,3],
'g':[1,3,5,7,1,0],
'h':[7,8,9,4,2,3],
'i':[1,3,5,7,1,0]
})
df.iloc[:,4:7] = df.iloc[:,4:7].applymap('{:.2%}'.format)
print (df)
a b c d e f g h i
0 a a 4 7 500.00% 700.00% 100.00% 7 1
1 b a 5 8 300.00% 800.00% 300.00% 8 3
2 c a 4 9 600.00% 900.00% 500.00% 9 5
3 d b 5 4 900.00% 400.00% 700.00% 4 7
4 e b 5 2 200.00% 200.00% 100.00% 2 1
5 f b 4 3 400.00% 300.00% 0.00% 3 0

Loop over groups Pandas Dataframe and get sum/count

I am using Pandas to structure and process Data.
This is my DataFrame:
And this is the code which enabled me to get this DataFrame:
(data[['time_bucket', 'beginning_time', 'bitrate', 2, 3]].groupby(['time_bucket', 'beginning_time', 2, 3])).aggregate(np.mean)
Now I want to have the sum (Ideally, the sum and the count) of my 'bitrates' grouped in the same time_bucket. For example, for the first time_bucket((2016-07-08 02:00:00, 2016-07-08 02:05:00), it must be 93750000 as sum and 25 as count, for all the case 'bitrate'.
I did this :
data[['time_bucket', 'bitrate']].groupby(['time_bucket']).agg(['sum', 'count'])
And this is the result :
But I really want to have all my data in one DataFrame.
Can I do a simple loop over 'time_bucket' and apply a function which calculate the sum of all bitrates ?
Any ideas ? Thx !
I think you need merge, but need same levels of indexes of both DataFrames, so use reset_index. Last get original Multiindex by set_index:
data = pd.DataFrame({'A':[1,1,1,1,1,1],
'B':[4,4,4,5,5,5],
'C':[3,3,3,1,1,1],
'D':[1,3,1,3,1,3],
'E':[5,3,6,5,7,1]})
print (data)
A B C D E
0 1 4 3 1 5
1 1 4 3 3 3
2 1 4 3 1 6
3 1 5 1 3 5
4 1 5 1 1 7
5 1 5 1 3 1
df1 = data[['A', 'B', 'C', 'D','E']].groupby(['A', 'B', 'C', 'D']).aggregate(np.mean)
print (df1)
E
A B C D
1 4 3 1 5.5
3 3.0
5 1 1 7.0
3 3.0
df2 = data[['A', 'C']].groupby(['A'])['C'].agg(['sum', 'count'])
print (df2)
sum count
A
1 12 6
print (pd.merge(df1.reset_index(['B','C','D']), df2, left_index=True, right_index=True)
.set_index(['B','C','D'], append=True))
E sum count
A B C D
1 4 3 1 5.5 12 6
3 3.0 12 6
5 1 1 7.0 12 6
3 3.0 12 6
I try another solution to get output from df1, but this is aggregated so it is impossible get right data. If sum level C, you get 8 instead 12.

Add rank field to pandas dataframe by unique groups and sorting by multiple columns

say I have this dataframe and I want every unique user ID to have its own rank value based on the date stamp:
In [93]:
df = pd.DataFrame({
'userid':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'date_stamp':['2016-02-01', '2016-02-01', '2016-02-04', '2016-02-08', '2016-02-04', '2016-02-10', '2016-02-10', '2016-02-12'],
'tie_breaker':[1,2,3,4,1,2,3,4]})
df['date_stamp'] = df['date_stamp'].map(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
df['rank'] = df.groupby(['userid'])['date_stamp'].rank(ascending=True, method='min')
df
Out[93]:
date_stamp tie_breaker userid rank
0 2016-02-01 1 a 1
1 2016-02-01 2 a 1
2 2016-02-04 3 a 3
3 2016-02-08 4 a 4
4 2016-02-04 1 b 1
5 2016-02-10 2 b 2
6 2016-02-10 3 b 2
7 2016-02-12 4 b 4
So that's fine, but what if I wanted to add another field to serve as a tie-breaker when there are two of the same dates? I was hoping something would be as easy as:
df['rank'] = df.groupby(['userid'])[['date_stamp','tie_breaker']].rank(ascending=True, method='min')
but that doesn't work- any ideas?
ideal output:
date_stamp tie_breaker userid rank
0 2/1/16 1 a 1
1 2/1/16 2 a 2
2 2/4/16 3 a 3
3 2/8/16 4 a 4
4 2/4/16 1 b 1
5 2/10/16 2 b 2
6 2/10/16 3 b 3
7 2/12/16 4 b 4
Edited to have real data
Looks like the top solution here doesn't handle zeros in the tie_breaker field correctly - any idea what's going on?
df = pd.DataFrame({
'userid':['10010012083198581013', '10010012083198581013', '10010012083198581013', '10010012083198581013','10010012083198581013'],
'date_stamp':['2015-12-26 13:24:37', '2015-11-25 11:24:13', '2015-10-25 12:13:59', '2015-02-16 22:59:58','2015-08-17 11:43:43'],
'tie_breaker':[460000156735858, 460000152444239, 460000147374709, 11083155016444116916,0]})
df['date_stamp'] = df['date_stamp'].map(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
df['userid'] = df['userid'].astype(str)
df['tie_breaker'] = df['tie_breaker'].astype(str)
def myrank(g):
return pd.DataFrame(1 + np.lexsort((g['tie_breaker'].rank(),
g['date_stamp'].rank())),
index=g.index)
df['rank']=df.groupby(['userid']).apply(myrank)
df.sort('date_stamp')
Out[101]:
date_stamp tie_breaker userid rank
3 2015-02-16 11083155016444116916 10010012083198581013 2
4 2015-08-17 0 10010012083198581013 1
2 2015-10-25 460000147374709 10010012083198581013 3
1 2015-11-25 460000152444239 10010012083198581013 5
0 2015-12-26 460000156735858 10010012083198581013 4
With this dataframe:
df = pd.DataFrame({
'userid':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'date_stamp':['2016-02-01', '2016-02-01', '2016-02-04', '2016-02-08',
'2016-02-04', '2016-02-10', '2016-02-10', '2016-02-12'],
'tie_breaker':[1,2,3,4,1,2,3,4]})
One way to do it is:
def myrank(g):
return pd.DataFrame(1 + np.lexsort((g['tie_breaker'].rank(),
g['date_stamp'].rank())),
index=g.index)
df['rank']=df.groupby(['userid']).apply(myrank)
Output:
date_stamp tie_breaker userid rank
0 2016-02-01 1 a 1
1 2016-02-01 2 a 2
2 2016-02-04 3 a 3
3 2016-02-08 4 a 4
4 2016-02-04 1 b 1
5 2016-02-10 2 b 2
6 2016-02-10 3 b 3
7 2016-02-12 4 b 4

Categories