Add column with average value grouped by column - python

I want to replace column value of dataframe with mean(without zeros) value of column grouped by another column.
Dataframe df is like:
ID | TYPE | rate
-------------
1 | A | 0 <- Replace this
2 | B | 2
3 | C | 1
4 | A | 2
5 | C | 1
6 | C | 0 <- Replace this
7 | C | 8
8 | C | 2
9 | D | 0 <- Replace this
I have to replace values in rating where rating = 0:
df['rate'][df['rate']==0] = ?
with average value for that TYPE.
Average(without zeros) value for every type is:
A = 2/1 = 2
B = 2/1 = 2
C = (1 + 1 + 8 + 2)/4 = 3
D = 0 (default value when there isn't information for type)
Expected result:
ID | TYPE | rate
-------------
1 | A | 2 <- Changed
2 | B | 2
3 | C | 1
4 | A | 2
5 | C | 1
6 | C | 3 <- Changed
7 | C | 8
8 | C | 2
9 | D | 0 <- Changed

You could mask the rate column in the dataframe, GroupBy the TYPE and transform with the mean, which will exlude NaNs. The use fillna to replace the values in the masked dataframe:
ma = df.rate.mask(df.rate.eq(0))
df['rate'] = ma.fillna(ma.groupby(df.TYPE).transform('mean').fillna(0))
ID TYPE rate
0 1 A 2.0
1 2 B 2.0
2 3 C 1.0
3 4 A 2.0
4 5 C 1.0
5 6 C 3.0
6 7 C 8.0
7 8 C 2.0
8 9 D 0.0

Related

How do I get the maximum value for every group and rank with all other groups?

I want to find the max value for every team and rank the team ascending.
This is the dataframe:
TEAM | GROUP | SCORE
1 | A | 5
1 | B | 5
1 | C | 5
2 | D | 6
2 | A | 6
3 | D | 5
3 | A | 5
No team should have the same rank so in case the score is similar who shows up first gets the first rank - others will adjust accordingly. So the output for this is:
TEAM | GROUP | SCORE | RANK
1 | A | 5 | 1
1 | B | 5 | 1
1 | C | 5 | 1
2 | D | 6 | 3
2 | A | 6 | 3
3 | D | 5 | 2
3 | A | 5 | 2
I'm not very familiar with some python syntax but here's what I have so far:
team = df.groupby(['TEAM'])
for x in team:
df['Rank'] = x.groupby(['TEAM'])['SCORE'].max().rank()
Please try the below which uses sorting on score and team, then gets the changes and does a cumulative sum for rank:
s = df[['TEAM','SCORE']].sort_values(['SCORE','TEAM'])
df['RANK'] = s['TEAM'].ne(s['TEAM'].shift()).cumsum()
print(df)
TEAM GROUP SCORE RANK
0 1 A 5 1
1 1 B 5 1
2 1 C 5 1
3 2 D 6 3
4 2 A 6 3
5 3 D 5 2
6 3 A 5 2

Combine and expand Dataframe based on IDs in columns

I have 3 dataframes A,B,C:
import pandas as pd
A = pd.DataFrame({"id": [1,2],
"connected_to_B_id1":["A","B"],
"connected_to_B_id2":["B","C"],
"connected_to_B_id3":["C", np.nan],
# entry can have multiple ids from B
})
B = pd.DataFrame({"id": ["A","B","C"],
"connected_to_C_id1":[1,1,2],
"connected_to_C_id2":[2,2,np.nan],
# entry can have multiple ids from C
})
C = pd.DataFrame({"id": [1,2],
"name":["a","b"],
})
#Output should be D:
D = pd.DataFrame({"id_A": [1,1,1,1,1,2,2,2],
"id_B": ["A","A","B","B","C","B","B","C"],
"id_C": [1,2,1,2,2,1,2,1],
"name": ["a","b","a","b","b","a","b","a"]
})
I want to use the IDs stored in the "connected_to_X" columns of each dataframe to create a dataframe, which contains all relationships recorded in the three individual dataframes.
What is the most elegant way to combine the dataframes to A, B and C to D?
Currently I am using dicts,lists and for loops and its messy and complicated.
D:
|idx |id_A|id_B|id_C|name|
|---:|--:|--:|--:|--:|
| 0 | 1 | A | 1 | a |
| 1 | 1 | A | 2 | b |
| 2 | 1 | B | 1 | a |
| 3 | 1 | B | 2 | b |
| 4 | 1 | C | 2 | b |
| 5 | 2 | B | 1 | a |
| 6 | 2 | B | 2 | b |
| 7 | 2 | C | 1 | a |
You just need to unpivot A and B then you can join the tables up.
(A.
melt(id_vars='id').
merge(B.melt(id_vars='id'), left_on = 'value', right_on='id', how='left').
merge(C, left_on = 'value_y', right_on='id').
drop(columns = ['variable_x', 'variable_y', 'value_x']).
sort_values(['id_x', 'id_y']).
reset_index(drop=True).
reset_index()
)
index id_x id_y value_y id name
0 0 1 A 1.0 1 a
1 1 1 A 2.0 2 b
2 2 1 B 1.0 1 a
3 3 1 B 2.0 2 b
4 4 1 C 2.0 2 b
5 5 2 B 1.0 1 a
6 6 2 B 2.0 2 b
7 7 2 C 2.0 2 b

Dataframe: calculate difference in dates column by another column

I'm trying to calculate running difference on the date column depending on "event column".
So, to add another column with date difference between 1 in event column (there only 0 and 1).
Spo far I came to this half-working crappy solution
Dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Code:
x = df.loc[df['event']==1, 'date']
k = 0
for i in range(len(x)):
df.loc[k:x.index[i], 'duration'] = x.iloc[i] - k
k = x.index[i]
But I'm sure there is a more elegant solution.
Thanks for any advice.
Output format:
+------+-------+----------+
| date | event | duration |
+------+-------+----------+
| 1 | 0 | 3 |
| 2 | 0 | 3 |
| 3 | 1 | 3 |
| 4 | 0 | 6 |
| 5 | 0 | 6 |
| 6 | 0 | 6 |
| 7 | 0 | 6 |
| 8 | 0 | 6 |
| 9 | 1 | 6 |
| 10 | 0 | 4 |
| 11 | 0 | 4 |
| 12 | 0 | 4 |
| 13 | 1 | 4 |
| 14 | 0 | 2 |
| 15 | 1 | 2 |
+------+-------+----------+
Using your initial dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Add an index-like column to mark where the transitions occur (you could also base this on the date column if it is unique):
df = df.reset_index().rename(columns={'index':'idx'})
df.loc[df['event']==0, 'idx'] = np.nan
df['idx'] = df['idx'].fillna(method='bfill')
Then, use a groupby() to count the records, and backfill them to match your structure:
df['duration'] = df.groupby('idx')['event'].count()
df['duration'] = df['duration'].fillna(method='bfill')
# Alternatively, the previous two lines can be combined as pointed out by OP
# df['duration'] = df.groupby('idx')['event'].transform('count')
df = df.drop(columns='idx')
print(df)
date event duration
0 1 0 2.0
1 2 1 2.0
2 3 0 3.0
3 4 0 3.0
4 5 1 3.0
5 6 0 5.0
6 7 0 5.0
7 8 0 5.0
8 9 0 5.0
9 10 1 5.0
10 11 0 6.0
11 12 0 6.0
12 13 0 6.0
13 14 0 6.0
14 15 0 6.0
15 16 1 6.0
16 17 0 NaN
It ends up as a float value because of the NaN in the last row. This approach works well in general if there are obvious "groups" of things to count.
As an alternative, because the dates are already there as integers you can look at the differences in dates directly:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0]})
tmp = df[df['event']==1].copy()
tmp['duration'] = (tmp['date'] - tmp['date'].shift(1)).fillna(tmp['date'])
df = pd.merge(df, tmp[['date','duration']], on='date', how='left').fillna(method='bfill')

Python: how do I filter data as long as a group contains any of a particular value

df = pd.DataFrame({'VisitID':[1,1,1,1,2,2,2,3,3,4,4], 'Item':['A','B','C','D','A','D','B','B','C','D','C']})
I have a dataset like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
I want to return VisitID rows as long as that VisitID had a occurrence of item A OR B. How do I go about? Expected Result:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
In R, I can do this via
library(dplyr)
df %>% group_by(VisitID) %>% filter(any(Item %in% c('A', 'B')))
How can I perform this in Python?
Something like df.groupby(['VisitID']).query(any(['A','B']))?
The syntax is similar, just use groupby.filter:
df.groupby('VisitID').filter(lambda g: g.Item.isin(['A','B']).any())
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C
To extract groups contains either we can just use groupby().transform('any') on isin():
s = (df.Item.isin(['A','B'])
.groupby(df['VisitID']).transform('any')
)
df[s]
Output:
VisitID Item
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 D
6 2 B
7 3 B
8 3 C

Pandas: replace zero value with value of another column

How to replace zero value in a column with value from same row of another column where previous row value of column is zero i.e. replace only where non-zero has not been encountered yet?
For example: Given a dataframe with columns a, b and c:
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 0 | 0 |
| 1 | 5 | 0 | 0 |
| 2 | 3 | 4 | 0 |
| 3 | 2 | 0 | 3 |
| 4 | 1 | 8 | 1 |
+----+-----+-----+----+
replace zero values in b and c with values of a where previous value is zero
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 2 | 2 |
| 1 | 5 | 5 | 5 |
| 2 | 3 | 4 | 3 |
| 3 | 2 | 0 | 3 | <-- zero in this row is not replaced because of
| 4 | 1 | 8 | 1 | non-zero value (4) in row before it.
+----+-----+-----+----+
In [90]: (df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
...: .fillna(pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]),
...: columns=df.columns, index=df.index))
...: .astype(int)
...: )
Out[90]:
a b c
0 2 2 2
1 5 5 5
2 3 4 3
3 2 0 3
4 1 8 1
Explanation:
In [91]: df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
Out[91]:
a b c
0 2 NaN NaN
1 5 NaN NaN
2 3 4.0 NaN
3 2 0.0 3.0
4 1 8.0 1.0
now we can fill NaN's with the corresponding values from the DF below (which is built as 3 concatenated a columns):
In [92]: pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]), columns=df.columns, index=df.index)
Out[92]:
a b c
0 2 2 2
1 5 5 5
2 3 3 3
3 2 2 2
4 1 1 1

Categories