Summing values in a group of a groupby object? - python

I am trying to sum up the values of a column in a groupby object for each of entries by which I grouped.
Say I had a df like this:
Letters Numbers Items Bool
A 1 lamp 1
B 2 glass 1
B 2 table 1
C 5 pic 0
And I groupby letters and then want to know the sum of the bools in the letters group. How would I do this? I've been trying
df_new = df.groupby('letters').bool.sum()
...
df_new = df.groupby('letters').sum('bool')
and other variations...
In the end I would like to get a vector that contains a value for the sum of each of the letters' groups. For the ex., it would be [1,2,0].

You were really close! Given
>>> df
Letters Numbers Items Bool
0 A 1 lamp 1
1 B 2 glass 1
2 B 2 table 1
3 C 5 pic 0
You could sum everything and take the column you want:
>>> # slower
>>> df.groupby("Letters").sum()["Bool"] # sum everything, select Bool
Letters
A 1
B 2
C 0
Name: Bool, dtype: int64
Or better, take only the column you want and sum it:
>>> df.groupby("Letters")["Bool"].sum() # select Bool, sum it
Letters
A 1
B 2
C 0
Name: Bool, dtype: int64
I prefer to stick with the Series, because you can do more with it, but you can convert this to a list using list or .tolist() if you prefer.

Related

why is the NaN not being included in the count when i do a groupby? PYTHON "size" does NOT work as it doesn't give the count of NaNs [duplicate]

That is the difference between groupby("x").count and groupby("x").size in pandas ?
Does size just exclude nil ?
size includes NaN values, count does not:
In [46]:
df = pd.DataFrame({'a':[0,0,1,2,2,2], 'b':[1,2,3,4,np.NaN,4], 'c':np.random.randn(6)})
df
Out[46]:
a b c
0 0 1 1.067627
1 0 2 0.554691
2 1 3 0.458084
3 2 4 0.426635
4 2 NaN -2.238091
5 2 4 1.256943
In [48]:
print(df.groupby(['a'])['b'].count())
print(df.groupby(['a'])['b'].size())
a
0 2
1 1
2 2
Name: b, dtype: int64
a
0 2
1 1
2 3
dtype: int64
What is the difference between size and count in pandas?
The other answers have pointed out the difference, however, it is not completely accurate to say "size counts NaNs while count does not". While size does indeed count NaNs, this is actually a consequence of the fact that size returns the size (or the length) of the object it is called on. Naturally, this also includes rows/values which are NaN.
So, to summarize, size returns the size of the Series/DataFrame1,
df = pd.DataFrame({'A': ['x', 'y', np.nan, 'z']})
df
A
0 x
1 y
2 NaN
3 z
<!- _>
df.A.size
# 4
...while count counts the non-NaN values:
df.A.count()
# 3
Notice that size is an attribute (gives the same result as len(df) or len(df.A)). count is a function.
1. DataFrame.size is also an attribute and returns the number of elements in the DataFrame (rows x columns).
Behaviour with GroupBy - Output Structure
Besides the basic difference, there's also the difference in the structure of the generated output when calling GroupBy.size() vs GroupBy.count().
df = pd.DataFrame({
'A': list('aaabbccc'),
'B': ['x', 'x', np.nan, np.nan,
np.nan, np.nan, 'x', 'x']
})
df
A B
0 a x
1 a x
2 a NaN
3 b NaN
4 b NaN
5 c NaN
6 c x
7 c x
Consider,
df.groupby('A').size()
A
a 3
b 2
c 3
dtype: int64
Versus,
df.groupby('A').count()
B
A
a 2
b 0
c 2
GroupBy.count returns a DataFrame when you call count on all column, while GroupBy.size returns a Series.
The reason being that size is the same for all columns, so only a single result is returned. Meanwhile, the count is called for each column, as the results would depend on on how many NaNs each column has.
Behavior with pivot_table
Another example is how pivot_table treats this data. Suppose we would like to compute the cross tabulation of
df
A B
0 0 1
1 0 1
2 1 2
3 0 2
4 0 0
pd.crosstab(df.A, df.B) # Result we expect, but with `pivot_table`.
B 0 1 2
A
0 1 2 1
1 0 0 1
With pivot_table, you can issue size:
df.pivot_table(index='A', columns='B', aggfunc='size', fill_value=0)
B 0 1 2
A
0 1 2 1
1 0 0 1
But count does not work; an empty DataFrame is returned:
df.pivot_table(index='A', columns='B', aggfunc='count')
Empty DataFrame
Columns: []
Index: [0, 1]
I believe the reason for this is that 'count' must be done on the series that is passed to the values argument, and when nothing is passed, pandas decides to make no assumptions.
Just to add a little bit to #Edchum's answer, even if the data has no NA values, the result of count() is more verbose, using the example before:
grouped = df.groupby('a')
grouped.count()
Out[197]:
b c
a
0 2 2
1 1 1
2 2 3
grouped.size()
Out[198]:
a
0 2
1 1
2 3
dtype: int64
When we are dealing with normal dataframes then only difference will be an inclusion of NAN values, means count does not include NAN values while counting rows.
But if we are using these functions with the groupby then, to get the correct results by count() we have to associate any numeric field with the groupby to get the exact number of groups where for size() there is no need for this type of association.
In addition to all above answers, I would like to point out one more difference which I find significant.
You can correlate pandas' DataFrame size and count with Java's Vectors size and length. When we create a vector, some predefined memory is allocated to it. When we reach closer to the maximum number of elements it can hold, more memory is allocated to accommodate further additions. Similarly, in DataFrame as we add elements, the memory allocated to it increases.
The size attribute gives the number of memory cell allocated to DataFrame whereas count gives the number of elements that are actually present in DataFrame. For example,
You can see that even though there are 3 rows in DataFrame, its size is 6.
This answer covers size and count difference with respect to DataFrame and not pandas Series. I have not checked what happens with Series.

Column-wise string counts for multiple columns in a pandas DataFrame

I have dataframe as below:
Name Marks Place Points
John-->Hile 50 Germany-->Poland 1
Rog-->Oliver 60-->70 Australia-->US 2
Harry 80 UK 3
Faye-->George 90 Poland 4
I want a result as below which finds counts of value having "-->"column wise and transpose it and result as below dataframe:
Column Count
Name 3
Marks 1
Place 1
This df is eg.This datframe is dynamic and can vary in each run like in 2nd Run we might have Name,Marks,Place or Name,Marks or anything else, So code should be dynamic which can run on any df.
You can select object columns and column-wise perform a count and summation:
df.select_dtypes(object).apply(lambda x: x.str.contains('-->')).sum()
Name 3
Marks 1
Place 2
dtype: int64
Another weird, but interesting method with applymap:
(df.select_dtypes(object)
.applymap(lambda x: '-->' in x if isinstance(x, str) else False)
.sum())
Name 3
Marks 1
Place 2
dtype: int64

How to count the number of occurrences that a certain value occurs in a DataFrame according to another column?

I have a Pandas DataFrame that has two columns as such:
item1 label
0 a 0
1 a 1
2 b 0
3 c 0
4 a 1
5 a 0
6 b 0
In sum, there are a total of three kinds of items in the column item1. Namely, a, b, and c. The values that the entries of the label column are either 0 or 1.
What I want to do is receive a DataFrame where I have a count of how many entries in item1 have label value 1. Using the toy example above, the desired DataFrame would be something like:
item1 label
0 a 2
1 b 0
2 c 0
How might I achieve something like that?
I've tried using the following line of code:
df[['item1', 'label']].groupby('item1').sum()['label']
but the result is a Pandas Series and also displays some behaviors and properties that aren't desired.
IIUC, you can use pd.crosstab:
count_1=pd.crosstab(df['item1'],df['label'])[1]
print(count_1)
item1
a 2
b 0
c 0
Name: 1, dtype: int64
To get a DataFrame:
count_1=pd.crosstab(df['item1'],df['label'])[1].rename('label').reset_index()
print(count_1)
item1 label
0 a 2
1 b 0
2 c 0
The good thing about this method is that it allows you to also get the number of 0 easily, which if you use the sum you don't get
Filter columns before groupby is not necessary, but you can specify column after groupby for aggregation sum. For 2 columns DataFrames add as_index=False parameter:
df = df.groupby('item1', as_index=False)['label'].sum()
Alternative is use Series.reset_index:
df = df.groupby('item1')['label'].sum().reset_index()
print (df)
item1 label
0 a 2
1 b 0
2 c 0

How do I delete columns where the average of the column already exists

In the example below, Column C should be deleted because it already exists (Column A should remain)
type(df): pandas.core.frame.DataFrame
A B C
1 2 1
0 2 0
3 2 3
I tried creating a dictionary to later delete repeated values but got stuck
dict_test = {}
for each_column in df:
dict_test[each_column] = df[[each_column]].mean()
dict_test
The result came out to be dtype: float64, 'A' : A 1.33333
The problem above is that the dictionary is storing the 'Key and Value' in the Value section so I can't compare Values to one another
You can use df.mean().drop_duplicates() and pandas indexing:
In [30]: df[df.mean().drop_duplicates().index]
Out[30]:
A B
0 1 2
1 0 2
2 3 2

Pandas COUNTIF based on column value

I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work
You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64
df.apply(lambda x: (x == x[0]).sum()-1,axis=1)

Categories