How can I count the unique values in a Pandas Dataframe? - python

I have a pandas DataFrame that looks like Y =
0 1 2 3
0 1 1 0 0
1 0 0 0 0
2 1 1 1 0
3 1 1 0 0
4 1 1 0 0
5 1 1 0 0
6 1 0 0 0
7 1 1 1 0
8 1 0 0 0
... .. .. .. ..
14989 1 1 1 1
14990 1 1 1 0
14991 1 1 1 1
14992 1 1 1 0
[14993 rows x 4 columns]
There are a total of 5 unique values:
1 1 0 0
0 0 0 0
1 1 1 0
1 0 0 0
1 1 1 1
For each unique value, I want to count how many times it's in the Y DataFrame

Let us using np.unique
c,v=np.unique(df.values,axis=0,return_counts =True)
c
array([[0, 0, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0]], dtype=int64)
v
array([1, 2, 4, 2], dtype=int64)

We can use .groupby for this to get the unique combinations.
While applying groupby, we calculate the size of the aggregation.
# Groupby on all columns which aggregates the data
df_group = df.groupby(list(df.columns)).size().reset_index()
# Because we used reset_index we need to rename our count column
df_group.rename({0:'count'}, inplace=True, axis=1)
Output
0 1 2 3 count
0 0 0 0 0 1
1 1 0 0 0 2
2 1 1 0 0 4
3 1 1 1 0 4
4 1 1 1 1 2
Note
I copied the example dataframe you provided.
Which looks like this:
print(df)
0 1 2 3
0 1 1 0 0
1 0 0 0 0
2 1 1 1 0
3 1 1 0 0
4 1 1 0 0
5 1 1 0 0
6 1 0 0 0
7 1 1 1 0
8 1 0 0 0
14989 1 1 1 1
14990 1 1 1 0
14991 1 1 1 1
14992 1 1 1 0

I made sample for you.
import itertools
import random
iter_list = list(itertools.product([0,1],[0,1],[0,1],[0,1]))
sum_list = []
for i in range(1000):
sum_list.append(random.choice(iter_list))
target_df = pd.DataFrame(sum_list)
target_df.reset_index().groupby(list(target_df.columns)).count().rename(columns ={'index':'count'}).reset_index()

Related

Is it possible to create multiple columns that are based on each other?

Let's say I have this DF:
Range = np.arange(0,10,1)
Table = pd.DataFrame({"Row": Range})
Now I'd like to add 2 new columns that are based on each other -
For example -
Table["A"]=np.where(Table["B"].shift(1)>0,1,0)
Table["B"]=np.where(Table["A"]==1,1,0)
In Excel for example there isn't an error for a circle of invalid data.
I wonder if that is possible on df in python.
Thanks.
You can try an iterative approach, initializing B to something (e.g., all items equal to None or 1 or 0, or some other initialization pattern), setting A as in your question, then looping and updating A and B until each converges or a maximum iteration threshold is reached:
import pandas as pd
import numpy as np
Range = np.arange(0,4,1)
Table = pd.DataFrame({"Row": Range})
print(Table)
for pair in [(np.nan, np.nan), (None, None), (0,0),(0,1),(1,0),(1,1)]:
print(f'\npair is {pair}')
Table["B"] = np.where(Table["Row"] % 2, pair[0], pair[1])
Table["A"] = np.where(Table["B"].shift(1) > 0, 1, 0)
i = 0
prevA, prevB = None, None
while prevA is None or not prevB.equals(Table["B"]) or not prevA.equals(Table["A"]):
if i >= 20:
print('max iterations exceeded')
break
print(f'iteration {i}')
print(Table)
i += 1
prevA, prevB = Table["A"].copy(), Table["B"].copy()
Table["A"] = np.where(Table["B"].shift(1) > 0, 1, 0)
Table["B"] = np.where(Table["A"] == 1, 1, 0)
Note that with the example logic from OP, this does indeed seem to converge for a few initialization patterns for B (at least it does using a shorter input array of length 4).
Sample output:
pair is (nan, nan)
iteration 0
Row B A
0 0 NaN 0
1 1 NaN 0
2 2 NaN 0
3 3 NaN 0
iteration 1
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (None, None)
iteration 0
Row B A
0 0 None 0
1 1 None 0
2 2 None 0
3 3 None 0
iteration 1
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (0, 0)
iteration 0
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (0, 1)
iteration 0
Row B A
0 0 1 0
1 1 0 1
2 2 1 0
3 3 0 1
iteration 1
Row B A
0 0 0 0
1 1 1 1
2 2 0 0
3 3 1 1
iteration 2
Row B A
0 0 0 0
1 1 0 0
2 2 1 1
3 3 0 0
iteration 3
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 1
iteration 4
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (1, 0)
iteration 0
Row B A
0 0 0 0
1 1 1 0
2 2 0 1
3 3 1 0
iteration 1
Row B A
0 0 0 0
1 1 0 0
2 2 1 1
3 3 0 0
iteration 2
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 1
iteration 3
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
pair is (1, 1)
iteration 0
Row B A
0 0 1 0
1 1 1 1
2 2 1 1
3 3 1 1
iteration 1
Row B A
0 0 0 0
1 1 1 1
2 2 1 1
3 3 1 1
iteration 2
Row B A
0 0 0 0
1 1 0 0
2 2 1 1
3 3 1 1
iteration 3
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 1
iteration 4
Row B A
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0

Pandas resetting cumsum() based on a condition of another column

I have a column called 'on' with a series of 0 and 1:
d1 = {'on': [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]}
df = pd.DataFrame(d1)
I want to create a new column called 'value' such that it do a cumulative count cumsum() only when the '1' of the 'on' column is on and recount from zero once the 'on' column shows zero.
I tried using a combination of cumsum() and np.where but I don't get what I want as follows:
df['value_try'] = df['on'].cumsum()
df['value_try'] = np.where(df['on'] == 0, 0, df['value_try'])
Attempt:
on value_try
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 4
9 1 5
10 0 0
What my desired output would be:
on value
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 1
9 1 2
10 0 0
You can set groups on consecutive 0 or 1 by checking whether the value of on is equal to that of previous row by .shift() and get group number by .Series.cumsum(). Then for each group use .Groupby.cumsum() to get the value within group.
g = df['on'].ne(df['on'].shift()).cumsum()
df['value'] = df.groupby(g).cumsum()
Result:
print(df)
on value
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 1
9 1 2
10 0 0
Let us try cumcount + cumsum
df['out'] = df.groupby(df['on'].eq(0).cumsum()).cumcount()
Out[18]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 0
8 1
9 2
10 0
dtype: int64

How to cell values as new columns in pandas dataframe

I have a dataframe like the following:
Labels
1 Nail_Polish,Nails
2 Nail_Polish,Nails
3 Foot_Care,Targeted_Body_Care
4 Foot_Care,Targeted_Body_Care,Skin_Care
I want to generate the following matrix:
Nail_Polish Nails Foot_Care Targeted_Body_Care Skin_Care
1 1 1 0 0 0
2 1 1 0 0 0
3 0 0 1 1 0
4 0 0 1 1 1
How can I achieve this?
Use str.get_dummies:
df2 = df['Labels'].str.get_dummies(sep=',')
The resulting output:
Foot_Care Nail_Polish Nails Skin_Care Targeted_Body_Care
1 0 1 1 0 0
2 0 1 1 0 0
3 1 0 0 0 1
4 1 0 0 1 1

How to get DataFrame indices for multiple rows configurations using Pandas in Python?

Consider the DataFrame P1 and P2:
P1 =
A B
0 0 0
1 0 1
2 1 0
3 1 1
P2 =
A B C
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I would like to know if there is a concise and efficient way of getting the indices in P1 for the row (tuple/configurations/assignments) of columns ['A','B'] in P2.
That is, given P2['A','B']:
P2['A','B'] =
A B
0 0 0
1 0 0
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 1 1
I would like to get [0, 0, 1, 1, 2, 2, 3, 3], since the first and second rows in P2['A','B'] corresponds to the first row in P1, and so on.
You could use merge and extract the overlapping keys
In [3]: tmp = p2[['A', 'B']].merge(p1.reset_index())
In [4]: tmp
Out[4]:
A B index
0 0 0 0
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 2
5 1 0 2
6 1 1 3
7 1 1 3
Get the values.
In [5]: tmp['index'].values
Out[5]: array([0, 0, 1, 1, 2, 2, 3, 3], dtype=int64)
However, there could be a native NumPy method to do this aswell.

Counting number of zeros per row by Pandas DataFrame?

Given a DataFrame I would like to compute number of zeros per each row. How can I compute it with Pandas?
This is presently what I ve done, this returns indices of zeros
def is_blank(x):
return x == 0
indexer = train_df.applymap(is_blank)
Use a boolean comparison which will produce a boolean df, we can then cast this to int, True becomes 1, False becomes 0 and then call count and pass param axis=1 to count row-wise:
In [56]:
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
df
Out[56]:
a b c
0 1 0 0
1 0 0 0
2 0 1 0
3 1 0 0
4 3 1 0
In [64]:
(df == 0).astype(int).sum(axis=1)
Out[64]:
0 2
1 3
2 2
3 2
4 1
dtype: int64
Breaking the above down:
In [65]:
(df == 0)
Out[65]:
a b c
0 False True True
1 True True True
2 True False True
3 False True True
4 False False True
In [66]:
(df == 0).astype(int)
Out[66]:
a b c
0 0 1 1
1 1 1 1
2 1 0 1
3 0 1 1
4 0 0 1
EDIT
as pointed out by david the astype to int is unnecessary as the Boolean types will be upcasted to int when calling sum so this simplifies to:
(df == 0).sum(axis=1)
You can count the zeros per column using the following function of python pandas.
It may help someone who needs to count the particular values per each column
df.isin([0]).sum(axis=1)
Here df is the dataframe and the value which we want to count is 0
Here is another solution using apply() and value_counts().
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
df.apply( lambda s : s.value_counts().get(key=0,default=0), axis=1)
Given the following dataframe df
df = pd.DataFrame({'A': [1, 1, 1, 1, 1, 0, 1, 0, 0, 0],
'B': [0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
'C': [1, 1, 1, 0, 0, 1, 0, 0, 0, 0],
'D': [0, 0, 0, 0, 0, 0, 1, 0, 1, 0],
'E': [0, 0, 1, 0, 1, 0, 0, 1, 0, 1]})
[Out]:
A B C D E
0 1 0 1 0 0
1 1 0 1 0 0
2 1 0 1 0 1
3 1 0 0 0 0
4 1 1 0 0 1
5 0 0 1 0 0
6 1 0 0 1 0
7 0 0 0 0 1
8 0 0 0 1 0
9 0 1 0 0 1
Apart from the various answers mentioned before, if the requirement is just using Pandas, another option would be using pandas.DataFrame.eq
df['Zero_Count'] = df.eq(0).sum(axis=1)
[Out]:
A B C D E Zero_Count
0 1 0 1 0 0 3
1 1 0 1 0 0 3
2 1 0 1 0 1 2
3 1 0 0 0 0 4
4 1 1 0 0 1 2
5 0 0 1 0 0 4
6 1 0 0 1 0 3
7 0 0 0 0 1 4
8 0 0 0 1 0 4
9 0 1 0 0 1 3
However, one can also do it with numpy using numpy.sum
import numpy as np
df['Zero_Count'] = np.sum(df == 0, axis=1)
[Out]:
A B C D E Zero_Count
0 1 0 1 0 0 3
1 1 0 1 0 0 3
2 1 0 1 0 1 2
3 1 0 0 0 0 4
4 1 1 0 0 1 2
5 0 0 1 0 0 4
6 1 0 0 1 0 3
7 0 0 0 0 1 4
8 0 0 0 1 0 4
9 0 1 0 0 1 3
Or even using numpy.count_nonzero as follows
df['Zero_Count'] = np.count_nonzero(df == 0, axis=1)
[Out]:
A B C D E Zero_Count
0 1 0 1 0 0 3
1 1 0 1 0 0 3
2 1 0 1 0 1 2
3 1 0 0 0 0 4
4 1 1 0 0 1 2
5 0 0 1 0 0 4
6 1 0 0 1 0 3
7 0 0 0 0 1 4
8 0 0 0 1 0 4
9 0 1 0 0 1 3

Categories