groupby to get the average, using dynamic condition - python

I have been searching about groupby using conditions and found many posts about that. This one for example: Pandas: conditional group-specific computations
However, I couldn't find any where the condition is applied over itself. In my case I'd like to get the average (or count or any other formula for that matter) but the thing that I couldn't find is to filter the dataset over a dynamic condition.
To illustrate this, this is the summarized dataset:
ID | Seq | Total
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
2 | 1 | 1
2 | 2 | 2
2 | 3 | 1
If I want to get the mean grouped by ID, but with the additional condition that for each record within the group, only those where the Seq is smaller must be computed. This should be the result
ID | Seq | Total | x
1 | 1 | 1 | 1 <-- mean of 1
1 | 2 | 2 | 1.5 <-- mean of 1 and 2
1 | 3 | 3 | 2 <-- mean of 1,2 and 3
2 | 1 | 1 | 1 <-- mean of 1
2 | 2 | 2 | 1.5 <-- mean of 1 and 2
2 | 3 | 1 | 1.33 < mean of 1, 2 and 1
Any help will be appreciated!

It looks like you are just trying to get the expanding().mean() of the ID-grouped Total column, e.g.:
In []:
df['x'] = df.groupby('ID')['Total'].expanding().mean().values
df
Out[]:
ID Seq Total x
0 1 1 1 1.000000
1 1 2 2 1.500000
2 1 3 3 2.000000
3 2 1 1 1.000000
4 2 2 2 1.500000
5 2 3 1 1.333333

Related

How do I get the maximum value for every group and rank with all other groups?

I want to find the max value for every team and rank the team ascending.
This is the dataframe:
TEAM | GROUP | SCORE
1 | A | 5
1 | B | 5
1 | C | 5
2 | D | 6
2 | A | 6
3 | D | 5
3 | A | 5
No team should have the same rank so in case the score is similar who shows up first gets the first rank - others will adjust accordingly. So the output for this is:
TEAM | GROUP | SCORE | RANK
1 | A | 5 | 1
1 | B | 5 | 1
1 | C | 5 | 1
2 | D | 6 | 3
2 | A | 6 | 3
3 | D | 5 | 2
3 | A | 5 | 2
I'm not very familiar with some python syntax but here's what I have so far:
team = df.groupby(['TEAM'])
for x in team:
df['Rank'] = x.groupby(['TEAM'])['SCORE'].max().rank()
Please try the below which uses sorting on score and team, then gets the changes and does a cumulative sum for rank:
s = df[['TEAM','SCORE']].sort_values(['SCORE','TEAM'])
df['RANK'] = s['TEAM'].ne(s['TEAM'].shift()).cumsum()
print(df)
TEAM GROUP SCORE RANK
0 1 A 5 1
1 1 B 5 1
2 1 C 5 1
3 2 D 6 3
4 2 A 6 3
5 3 D 5 2
6 3 A 5 2

Python pandas count occurrences in each column

I am new to pandas. Can someone help me in calculating frequencies of values for each columns.
Dataframe:
id|flag1|flag2|flag3|
---------------------
1 | 1 | 2 | 1 |
2 | 3 | 1 | 1 |
3 | 3 | 4 | 4 |
4 | 4 | 1 | 4 |
5 | 2 | 3 | 2 |
I want something like
id|flag1|flag2|flag3|
---------------------
1 | 1 | 2 | 2 |
2 | 1 | 1 | 1 |
3 | 2 | 1 | 0 |
4 | 1 | 1 | 2 |
Explanation - id 1 has 1 value in flag1, 2 values in flag2 and 2 values in flag3.
First filter only flag columns by filter or removing id column and then apply function value_counts, last replace NaNs to 0 and cast to ints:
df = df.filter(like='flag').apply(lambda x: x.value_counts()).fillna(0).astype(int)
print (df)
flag1 flag2 flag3
1 1 2 2
2 1 1 1
3 2 1 0
4 1 1 2
Or:
df = df.drop('id', 1).apply(lambda x: x.value_counts()).fillna(0).astype(int)
print (df)
flag1 flag2 flag3
1 1 2 2
2 1 1 1
3 2 1 0
4 1 1 2
Thank you, Bharath for suggestion:
df = df.filter(like='flag').apply(pd.Series.value_counts()).fillna(0).astype(int)

Shuffling Several DataFrames Together

Is it possible to shuffle several DataFrames together?
For example I have a DataFrame df1 and a DataFrame df2. I want to shuffle the rows randomly, but for both DataFrames in the same way.
Example
df1:
|___|_______|
| 1 | ... |
| 2 | ... |
| 3 | ... |
| 4 | ... |
df2:
|___|_______|
| 1 | ... |
| 2 | ... |
| 3 | ... |
| 4 | ... |
After shuffling a possible order for both DataFrames could be:
|___|_______|
| 2 | ... |
| 3 | ... |
| 4 | ... |
| 1 | ... |
I think you can double reindex with applying numpy.random.permutation to index, but is necessary both DataFrames have same length and same unique index values:
df1 = pd.DataFrame({'a':range(5)})
print (df1)
a
0 0
1 1
2 2
3 3
4 4
df2 = pd.DataFrame({'a':range(5)})
print (df2)
a
0 0
1 1
2 2
3 3
4 4
idx = np.random.permutation(df1.index)
print (df1.reindex(idx))
a
2 2
4 4
1 1
3 3
0 0
print (df2.reindex(idx))
a
2 2
4 4
1 1
3 3
0 0
Alternative with reindex_axis:
print (df1.reindex_axis(idx, axis=0))
print (df2.reindex_axis(idx, axis=0))

Pandas: replace zero value with value of another column

How to replace zero value in a column with value from same row of another column where previous row value of column is zero i.e. replace only where non-zero has not been encountered yet?
For example: Given a dataframe with columns a, b and c:
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 0 | 0 |
| 1 | 5 | 0 | 0 |
| 2 | 3 | 4 | 0 |
| 3 | 2 | 0 | 3 |
| 4 | 1 | 8 | 1 |
+----+-----+-----+----+
replace zero values in b and c with values of a where previous value is zero
+----+-----+-----+----+
| | a | b | c |
|----+-----+-----|----|
| 0 | 2 | 2 | 2 |
| 1 | 5 | 5 | 5 |
| 2 | 3 | 4 | 3 |
| 3 | 2 | 0 | 3 | <-- zero in this row is not replaced because of
| 4 | 1 | 8 | 1 | non-zero value (4) in row before it.
+----+-----+-----+----+
In [90]: (df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
...: .fillna(pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]),
...: columns=df.columns, index=df.index))
...: .astype(int)
...: )
Out[90]:
a b c
0 2 2 2
1 5 5 5
2 3 4 3
3 2 0 3
4 1 8 1
Explanation:
In [91]: df[~df.apply(lambda c: c.eq(0) & c.shift().fillna(0).eq(0))]
Out[91]:
a b c
0 2 NaN NaN
1 5 NaN NaN
2 3 4.0 NaN
3 2 0.0 3.0
4 1 8.0 1.0
now we can fill NaN's with the corresponding values from the DF below (which is built as 3 concatenated a columns):
In [92]: pd.DataFrame(np.tile(df.a.values[:, None], df.shape[1]), columns=df.columns, index=df.index)
Out[92]:
a b c
0 2 2 2
1 5 5 5
2 3 3 3
3 2 2 2
4 1 1 1

Interleaving Pandas Dataframes by Timestamp

I've got 2 Pandas DataFrame, each of them containing 2 columns. One of the columns is a timestamp column [t], the other one contains sensor readings [s].
I now want to create a single DataFrame, containing 4 columns, that is interleaved on the timestamp column.
Example:
First Dataframe:
+----+----+
| t1 | s1 |
+----+----+
| 0 | 1 |
| 2 | 3 |
| 3 | 3 |
| 5 | 2 |
+----+----+
Second DataFrame:
+----+----+
| t2 | s2 |
+----+----+
| 1 | 5 |
| 2 | 3 |
| 4 | 3 |
+----+----+
Target:
+----+----+----+----+
| t1 | t2 | s1 | s2 |
+----+----+----+----+
| 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 5 |
| 2 | 1 | 3 | 5 |
| 2 | 2 | 3 | 3 |
| 3 | 2 | 3 | 3 |
| 3 | 4 | 3 | 3 |
| 5 | 4 | 2 | 3 |
+----+----+----+----+
I hat a look at pandas.merge, but that left me with a lot of NaNs and an unsorted table.
a.merge(b, how='outer')
Out[55]:
t1 s1 t2 s2
0 0 1 NaN NaN
1 2 3 2 3
2 3 3 NaN NaN
3 5 2 NaN NaN
4 1 NaN 1 5
5 4 NaN 4 3
Merging will put NaNs in common columns that you merge on, if those values are not present in both indexes. It will not create new data that is not present in the dataframes that are being merged.
For example, index 0 in your target dataframe shows t2 with a value of 0. This is not present in the second dataframe, so you cannot expect it to appear in the merged dataframe either. Same applies for other rows as well.
What you can do instead is reindex the dataframes to a common index. In your case, since the maximum index is 5 in the target dataframe, lets use this list to reindex both input dataframes:
In [382]: ind
Out[382]: [0, 1, 2, 3, 4, 5]
Now, we will reindex according both inputs to this index:
In [372]: x = a.set_index('t1').reindex(ind).fillna(0).reset_index()
In [373]: x
Out[373]:
t1 s1
0 0 1
1 1 0
2 2 3
3 3 3
4 4 0
5 5 2
In [374]: y = b.set_index('t2').reindex(ind).fillna(0).reset_index()
In [375]: y
Out[375]:
t2 s2
0 0 0
1 1 5
2 2 3
3 3 0
4 4 5
5 5 0
And, now we merge it to get something close to the target dataframe:
In [376]: x.merge(y, left_on=['t1'], right_on=['t2'], how='outer')
Out[376]:
t1 s1 t2 s2
0 0 1 0 0
1 1 0 1 5
2 2 3 2 3
3 3 3 3 0
4 4 0 4 5
5 5 2 5 0

Categories