I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2
Related
I'm trying to find the duplicates between 2 columns, were order is independent, but I need to keep the count of duplicates after dropping them
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],
)
This is my expected result
source target count
0 A B 2
1 D B 1
3 B C 2
I've already tried several approaches, but I can't come close to a solution.
It does not matter which combination is maintained. In the result example I kept the first.
The following approach creates a new column containing a set of the values in the columns specified. The advantage is that all other columns are preserved in the final result. Furthermore, the indices are preserved the same way as in the expected output you posted:
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],)
# Create column with set of both columns
df['tmp'] = df.apply(lambda x: frozenset([x['source'], x['target']]), axis=1)
# Create count column based on new tmp column
df['count'] = df.groupby(['tmp'])['target'].transform('size')
# Drop duplicate rows based on new tmp column
df = df[~df.duplicated(subset='tmp', keep='first')]
# Remove tmp column
df = df.drop('tmp', 1)
df
Output:
source target count
0 A B 2
1 D B 1
3 B C 2
You can use df.duplicated() to see which ones are duplicated, the output is true if item is duplicated and false if it isn’t. For more infos and practical example check out the documentation
Create a summary based on applying a frozenset to your desired columns. Here we're using all columns.
summary = df.apply(frozenset, axis=1).value_counts()
This'll give you a Series of:
(A, B) 2
(C, B) 2
(B, D) 1
dtype: int64
You can then reconstruct a DataFrame by iterating over that series, eg:
df2 = pd.DataFrame(((*idx, val) for idx, val in summary.items()), columns=[*df.columns, 'count'])
Which results in:
source target count
0 A B 2
1 C B 2
2 B D 1
for the following df
group participated
A 1
A 1
B 0
A 0
B 1
A 1
B 0
B 0
I want to count the total number of values in the participated column for each value in the group column (groupby-count) and then find a count of how many 1s there are in each group too
Something like
group tot_participated 1s
A 4 3
B 4 1
I know the first part is simple and can be done by a simple
grouped_df=df.groupby('group').count().reset_index()
unable to wrap my head around the second part. Any help will be greatly appreciated!
You could follow the groupby with an aggregation as below:
grp_df = df.groupby('group', as_index=False).agg({'participated':['count','sum']})
grp_df.columns = ['group','tot_participated','1s']
grp_df.head()
The caveat to using .agg with multiple aggregation functions on the same column is that a multi-column index is created. This can be remedied by resetting the column names as in line 2.
I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4
I want to add a column with incremental numbers for rows with the same value in a defined row;
e.g. if I would have this df
df=pd.DataFrame([['a','b'],['a','c'],['c','b']])
and I want incremental numbers for the first column. It should look like this
df=pd.DataFrame([['a','b',1],['a','c',2],['c','b',1]])
I found sql solutions but I'm working with ipython/pandas. Can someone help me?
Use cumcount, for name of new column use length of original columns:
print (len(df.columns))
2
df[len(df.columns)] = df.groupby(0).cumcount() + 1
print (df)
0 1 2
0 a b 1
1 a c 2
2 c b 1
I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work
You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64
df.apply(lambda x: (x == x[0]).sum()-1,axis=1)