Replace all values in Dataframe with their count per column - python

import pandas as pd
data = ([1, 1], [2, 3], [4, 5], [1, 2], [3, 4], [3, 2])
df = pd.DataFrame(data)
print(df)
0 1
0 1 1
1 2 3
2 4 5
3 1 2
4 3 4
5 3 2
The desired output would be a count of all occurrences for each value in df. Desired output below:
0 1
0 2 1
1 1 1
2 1 1
3 2 2
4 2 1
5 2 2
There are 2 "1"'s in the first column, 1 "2", 1 "4", 2 "3"'s. ETC for each column.

You can map each column (Series s) to its value count:
>>> df.apply(lambda s: s.map(s.value_counts()))
0 1
0 2 1
1 1 1
2 1 1
3 2 2
4 2 1
5 2 2

One way is using transform on each columns:
df.apply(lambda x: x.groupby(x).transform('count'))
Or using Counter with replace:
from collections import Counter
df.replace(df.apply(Counter))
0 1
0 2 1
1 1 1
2 1 1
3 2 2
4 2 1
5 2 2

Alternatively you can leverage numpy.unique:
import numpy as np
def count_values(s):
_, idx, cnt = np.unique(s, return_inverse=True, return_counts=True)
return cnt[idx]
df.apply(count_values)
0 1
0 2 1
1 1 1
2 1 1
3 2 2
4 2 1
5 2 2

Related

Pandas MultiIndex DataFrame

I have an array:
w = np.array([1, 2, 3])
and I need to create a Dataframe with a MultiIndex looking like this:
df= 0 1 2
0 0 1 1 1
1 1 1 1
2 1 1 1
1 0 2 2 2
1 2 2 2
2 2 2 2
2 0 3 3 3
1 3 3 3
2 3 3 3
How can i assign the values of my array to the correct positions in the DataFrame?
The exact logic is unclear, by assuming w is the only input and you want to broadcast it as index(0,1) and columns:
w = np.array([1, 2, 3])
N = len(w)
df = pd.DataFrame(np.repeat(w, N**2).reshape((-1,N)),
index=pd.MultiIndex.from_product([np.arange(N)]*2)
)
output:
0 1 2
0 0 1 1 1
1 1 1 1
2 1 1 1
1 0 2 2 2
1 2 2 2
2 2 2 2
2 0 3 3 3
1 3 3 3
2 3 3 3

Replicating rows on Pandas Dataframe based on a value column and then affixing a counter column

Suppose I have this dataframe df:
A B count
0 1 2 3
1 3 4 2
2 5 6 1
3 7 8 2
Then I want to do row-replication operation depending on the count column, and then add a new column that does the counter. So the resulting outcome is:
counter A B count
0 0 1 2 3
1 1 1 2 3
2 2 1 2 3
3 0 3 4 2
4 1 3 4 2
5 0 5 6 1
6 0 7 8 2
7 1 7 8 2
My idea was to duplicate the rows accordingly (using numpy and pandas df). Then add a counter column that increments for every row found the same and then reset to 0 once found a new row. But I was thinking this may be slow. Is there any way to do it much easily and not that slow?
Let's try index.repeat to scale up the DataFrame, then groupby cumcount to create the groups and insert it into the DataFrame at the front:
df = df.loc[df.index.repeat(df['count'])]
df.insert(0, 'counter', df.groupby(level=0).cumcount())
df = df.reset_index(drop=True)
df:
counter A B count
0 0 1 2 3
1 1 1 2 3
2 2 1 2 3
3 0 3 4 2
4 1 3 4 2
5 0 5 6 1
6 0 7 8 2
7 1 7 8 2
DataFrame constructor:
import pandas as pd
df = pd.DataFrame({
'A': [1, 3, 5, 7], 'B': [2, 4, 6, 8], 'count': [3, 2, 1, 2]
})

Pandas how to fill sequence of rows with the previous value in dataframe

Suppose i have this datarame
df =pd.DataFrame([[1, 1, 0, 3], [1, 1, 1, 4], [1, 1, 3, 6], [2, 1, 0, 0], [2, 1, 3, 6]],
columns=["id","code","date","count"])
Output:
id code date count
0 1 1 0 3
1 1 1 1 4
2 1 1 3 6
3 2 1 0 0
4 2 1 3 6
I want to fill the missing date number (this one is between 0 and 3) with the previous count based on id and code.
Intended output:
id code date count
0 1 1 0 3
1 1 1 1 4
2 1 1 2 4
3 1 1 3 6
4 2 1 0 0
5 2 1 1 0
6 2 1 2 0
7 2 1 3 6
In your case, a combination of pivot and stack:
(df.pivot_table(index=['id','code'],
columns='date',
values='count')
.reindex(np.arange(4), axis=1)
.ffill(1)
.stack()
.reset_index(name='count')
)
Output:
id code date count
0 1 1 0 3.0
1 1 1 1 4.0
2 1 1 2 4.0
3 1 1 3 6.0
4 2 1 0 0.0
5 2 1 1 0.0
6 2 1 2 0.0
7 2 1 3 6.0
Update: if you have more than one count columns, it's a bit more tricky:
(df.pivot_table(index=['id','code'],
columns='date')
.stack(level=0)
.reindex(np.arange(4), axis=1)
.ffill(1)
.unstack(level=-1)
.stack(level=0)
.reset_index()
)

Pandas Dataframe groupby: apply several lambda functions at once

I group the following pandas dataframe by 'name' and then apply several lambda functions on 'value' to generate additional columns.
Is it possible to apply these lambda functions at once, to increase efficiency?
import pandas as pd
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'value': [1, 3, 1, 2, 3, 1, 2, 3, 3], })
df['Diff'] = df.groupby('name')['value'].transform(lambda x: x - x.iloc[0])
df['Count'] = df.groupby('name')['value'].transform(lambda x: x.count())
df['Index'] = df.groupby('name')['value'].transform(lambda x: x.index - x.index[0] + 1)
print(df)
Output:
name value Diff Count Index
0 A 1 0 2 1
1 A 3 2 2 2
2 B 1 0 4 1
3 B 2 1 4 2
4 B 3 2 4 3
5 B 1 0 4 4
6 C 2 0 3 1
7 C 3 1 3 2
8 C 3 1 3 3
Here is possible use GroupBy.apply with one function, but not sure if better performance:
def f(x):
a = x - x.iloc[0]
b = x.count()
c = x.index - x.index[0] + 1
return pd.DataFrame({'Diff':a, 'Count':b, 'Index':c})
df = df.join(df.groupby('name')['value'].apply(f))
print(df)
name value Diff Count Index
0 A 1 0 2 1
1 A 3 2 2 2
2 B 1 0 4 1
3 B 2 1 4 2
4 B 3 2 4 3
5 B 1 0 4 4
6 C 2 0 3 1
7 C 3 1 3 2
8 C 3 1 3 3

pandas aggregate data while carrying a column unchanged

I have a data frame, a:
a=pd.DataFrame({'ID': [1,1,2,2,3,4], 'B': [1,5,3,2,4,1], 'C': [1,4,3,6,1,1]})
ID B C
0 1 1 1
1 1 5 4
2 2 3 3
3 2 2 6
4 3 4 1
5 4 1 1
And I want to aggregate it so that the resulting new data frame will be grouped by ID and return the row corresponding to min of B (so apply min() on B and carry C as is.
So the resulting data frame should be:
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
How can I do this programmatically using pandas.groupby(), or is there another way to do it?
You can use groupby and transform to filter rows
a.loc[a['B'] == a.groupby('ID').B.transform('min')]
B C ID
0 1 1 1
3 2 6 2
4 4 1 3
5 1 1 4
Try sorting before your groupby, then taking first:
a.sort_values('B').groupby('ID',as_index=False).first()
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
Or, probably a faster way to do it is to sort by ID and B and then drop duplicate IDs, keeping the first (which is the default behavior of drop_duplicates):
a.sort_values(['ID','B']).drop_duplicates('ID')
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
When there is sorting involved, and the grouping doesn't involve any calculations, I prefer to work on the underlying numpy arrays for performance.
Using argsort and numpy.unique:
arr = a.values
out = arr[np.argsort(arr[:, 1])]
_, idx = np.unique(out[:, 0], return_index=True)
out[idx]
array([[1, 1, 1],
[2, 2, 6],
[3, 4, 1],
[4, 1, 1]], dtype=int64)
To reassign the values to your DataFrame:
pd.DataFrame(out[idx], columns=a.columns)
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1

Categories