python pandas pivot_table count frequency in one column - python

I am still new to Python pandas' pivot_table and would like to ask a way to count frequencies of values in one column, which is also linked to another column of ID. The DataFrame looks like the following.
import pandas as pd
df = pd.DataFrame({'Account_number':[1,1,2,2,2,3,3],
'Product':['A', 'A', 'A', 'B', 'B','A', 'B']
})
For the output, I'd like to get something like the following:
Product
A B
Account_number
1 2 0
2 1 2
3 1 1
So far, I tried this code:
df.pivot_table(rows = 'Account_number', cols= 'Product', aggfunc='count')
This code gives me the two same things. What is the problems with the code above? A part of the reason why I am asking this question is that this DataFrame is just an example. The real data that I am working on has tens of thousands of account_numbers.

You need to specify the aggfunc as len:
In [11]: df.pivot_table(index='Account_number', columns='Product',
aggfunc=len, fill_value=0)
Out[11]:
Product A B
Account_number
1 2 0
2 1 2
3 1 1
It looks like count, is counting the instances of each column (Account_number and Product), it's not clear to me whether this is a bug...

Solution: Use aggfunc='size'
Using aggfunc=len or aggfunc='count' like all the other answers on this page will not work for DataFrames with more than three columns. By default, pandas will apply this aggfunc to all the columns not found in index or columns parameters.
For instance, if we had two more columns in our original DataFrame defined like this:
df = pd.DataFrame({'Account_number':[1, 1, 2 ,2 ,2 ,3 ,3],
'Product':['A', 'A', 'A', 'B', 'B','A', 'B'],
'Price': [10] * 7,
'Quantity': [100] * 7})
Output:
Account_number Product Price Quantity
0 1 A 10 100
1 1 A 10 100
2 2 A 10 100
3 2 B 10 100
4 2 B 10 100
5 3 A 10 100
6 3 B 10 100
If you apply the current solutions to this DataFrame, you would get the following:
df.pivot_table(index='Account_number',
columns='Product',
aggfunc=len,
fill_value=0)
Output:
Price Quantity
Product A B A B
Account_number
1 2 0 2 0
2 1 2 1 2
3 1 1 1 1
Solution
Instead, use aggfunc='size'. Since size always returns the same number for each column, pandas does not call it on every single column and just does it once.
df.pivot_table(index='Account_number',
columns='Product',
aggfunc='size',
fill_value=0)
Output:
Product A B
Account_number
1 2 0
2 1 2
3 1 1

In new version of Pandas, slight modification is required. I had to spend some time figuring out so just wanted to add that here so that someone can directly use this.
df.pivot_table(index='Account_number', columns='Product', aggfunc=len,
fill_value=0)

You can use count:
df.pivot_table(index='Account_number', columns='Product', aggfunc='count')

I know this question is about pivot_table but for the problem given in the question, we can use crosstab:
out = pd.crosstab(df['Account_number'], df['Product'])
Output:
Product A B
Account_number
1 2 0
2 1 2
3 1 1

Related

why is the NaN not being included in the count when i do a groupby? PYTHON "size" does NOT work as it doesn't give the count of NaNs [duplicate]

That is the difference between groupby("x").count and groupby("x").size in pandas ?
Does size just exclude nil ?
size includes NaN values, count does not:
In [46]:
df = pd.DataFrame({'a':[0,0,1,2,2,2], 'b':[1,2,3,4,np.NaN,4], 'c':np.random.randn(6)})
df
Out[46]:
a b c
0 0 1 1.067627
1 0 2 0.554691
2 1 3 0.458084
3 2 4 0.426635
4 2 NaN -2.238091
5 2 4 1.256943
In [48]:
print(df.groupby(['a'])['b'].count())
print(df.groupby(['a'])['b'].size())
a
0 2
1 1
2 2
Name: b, dtype: int64
a
0 2
1 1
2 3
dtype: int64
What is the difference between size and count in pandas?
The other answers have pointed out the difference, however, it is not completely accurate to say "size counts NaNs while count does not". While size does indeed count NaNs, this is actually a consequence of the fact that size returns the size (or the length) of the object it is called on. Naturally, this also includes rows/values which are NaN.
So, to summarize, size returns the size of the Series/DataFrame1,
df = pd.DataFrame({'A': ['x', 'y', np.nan, 'z']})
df
A
0 x
1 y
2 NaN
3 z
<!- _>
df.A.size
# 4
...while count counts the non-NaN values:
df.A.count()
# 3
Notice that size is an attribute (gives the same result as len(df) or len(df.A)). count is a function.
1. DataFrame.size is also an attribute and returns the number of elements in the DataFrame (rows x columns).
Behaviour with GroupBy - Output Structure
Besides the basic difference, there's also the difference in the structure of the generated output when calling GroupBy.size() vs GroupBy.count().
df = pd.DataFrame({
'A': list('aaabbccc'),
'B': ['x', 'x', np.nan, np.nan,
np.nan, np.nan, 'x', 'x']
})
df
A B
0 a x
1 a x
2 a NaN
3 b NaN
4 b NaN
5 c NaN
6 c x
7 c x
Consider,
df.groupby('A').size()
A
a 3
b 2
c 3
dtype: int64
Versus,
df.groupby('A').count()
B
A
a 2
b 0
c 2
GroupBy.count returns a DataFrame when you call count on all column, while GroupBy.size returns a Series.
The reason being that size is the same for all columns, so only a single result is returned. Meanwhile, the count is called for each column, as the results would depend on on how many NaNs each column has.
Behavior with pivot_table
Another example is how pivot_table treats this data. Suppose we would like to compute the cross tabulation of
df
A B
0 0 1
1 0 1
2 1 2
3 0 2
4 0 0
pd.crosstab(df.A, df.B) # Result we expect, but with `pivot_table`.
B 0 1 2
A
0 1 2 1
1 0 0 1
With pivot_table, you can issue size:
df.pivot_table(index='A', columns='B', aggfunc='size', fill_value=0)
B 0 1 2
A
0 1 2 1
1 0 0 1
But count does not work; an empty DataFrame is returned:
df.pivot_table(index='A', columns='B', aggfunc='count')
Empty DataFrame
Columns: []
Index: [0, 1]
I believe the reason for this is that 'count' must be done on the series that is passed to the values argument, and when nothing is passed, pandas decides to make no assumptions.
Just to add a little bit to #Edchum's answer, even if the data has no NA values, the result of count() is more verbose, using the example before:
grouped = df.groupby('a')
grouped.count()
Out[197]:
b c
a
0 2 2
1 1 1
2 2 3
grouped.size()
Out[198]:
a
0 2
1 1
2 3
dtype: int64
When we are dealing with normal dataframes then only difference will be an inclusion of NAN values, means count does not include NAN values while counting rows.
But if we are using these functions with the groupby then, to get the correct results by count() we have to associate any numeric field with the groupby to get the exact number of groups where for size() there is no need for this type of association.
In addition to all above answers, I would like to point out one more difference which I find significant.
You can correlate pandas' DataFrame size and count with Java's Vectors size and length. When we create a vector, some predefined memory is allocated to it. When we reach closer to the maximum number of elements it can hold, more memory is allocated to accommodate further additions. Similarly, in DataFrame as we add elements, the memory allocated to it increases.
The size attribute gives the number of memory cell allocated to DataFrame whereas count gives the number of elements that are actually present in DataFrame. For example,
You can see that even though there are 3 rows in DataFrame, its size is 6.
This answer covers size and count difference with respect to DataFrame and not pandas Series. I have not checked what happens with Series.

Python DataFrame count how many different elements

I need to count how many different elements are in my DataFrame (df).
My df has the day of the month (as a number: 1,2,3 ... 31) in which a certain variable was measured. There are 3 columns that describe the number of the day. There are multiple measurements in one day so my columns have repeated values. I need to know how many days in a month was that variable measured ignoring how many times a day was that measurement done. So I was thinking that counting the days ignoring repeated values.
As an example the data of my df would look like this:
col1 col2 col3
2 2 2
2 2 3
3 3 3
3 4 8
I need an output that tells me that in that DataFrame the numbers are 2, 3, 4 and 8.
Thanks!
Just do:
df=pd.DataFrame({"col1": [2,2,3,3], "col2": [2,2,3,4], "col3": [2,3,3,8]})
df.stack().unique()
Outputs:
[2 3 4 8]
You can use the function drop_duplicates into your dataframe, like:
import pandas as pd
df = pd.DataFrame({'a':[2,2,3], 'b':[2,2,3], 'c':[2,2,3]})
a b c
0 2 2 2
1 2 2 2
2 3 3 3
df = df.drop_duplicates()
print(df['a'].count())
out: 2
Or you can use numpy to get the unique values in the dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X' : [2, 2, 3, 3], 'Y' : [2,2,3,4], 'Z' : [2,3,3,8]})
df_unique = np.unique(np.array(df))
print(df_unique)
#Output [2 3 4 8]
#for the count of days:
print(len(df_unique))
#Output 4
How about:
Assuming this is your initial df:
col1 col2 col3
0 2 2 2
1 2 2 2
2 3 3 3
Then:
count_df = pd.DataFrame()
for i in df.columns:
df2 = df[i].value_counts()
count_df = pd.concat([count_df, df2], axis=1)
final_df = count_df.sum(axis=1)
final_df = pd.DataFrame(data=final_df, columns=['Occurrences'])
print(final_df)
Occurrences
2 6
3 3
You can use pandas.unique() like so:
pd.unique(df.to_numpy().flatten())
I have done some basic benchmarking, this method appears to be the fastest.

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

Pandas how to aggregate more than one column

Here is the snippet:
test = pd.DataFrame({'userid': [1,1,1,2,2], 'order_id': [1,2,3,4,5], 'fee': [2,1,5,3,1]})
I'd like to group based on userid and count the 'order_id' column and sum the 'fee' column:
test.groupby('userid').order_id.count()
test.groupby('userid').fee.sum()
Is it possible to perform these two operations in one line of code so that I can get a resulting df looks like this:
userid counts sum
...
I've tried pivot_table:
test.pivot_table(index='userid', values=['order_id', 'fee'], aggfunc=[np.size, np.sum])
It gives something like this:
size sum
fee order_id fee order_id
userid
1 3 3 8 6
2 2 2 4 9
Is it possible to tell pandas to use np.size & np.sum on one column but not both?
Use DataFrameGroupBy.agg with rename columns:
d = {'order_id':'counts','fee':'sum'}
df = test.groupby('userid').agg({'order_id':'count', 'fee':'sum'})
.rename(columns=d)
.reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
But better is aggregate by size, because count is used if need exclude NaNs:
df = test.groupby('userid')
.agg({'order_id':'size', 'fee':'sum'})
.rename(columns=d).reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2

Pandas: merge multiple dataframes and control column names?

I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?
I have nine datasets. All of them have the following columns:
org, name, items,spend
I want to join them into a single dataframe with the following columns:
org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...
I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:
ad = pd.DataFrame.merge(df_presents, df_trees,
on=['practice', 'name'],
suffixes=['_presents', '_trees'])
This works great, doing print list(aggregate_data.columns.values) shows me the following columns:
[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]
But how can I do this for nine columns? merge only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.
You could use functools.reduce to iteratively apply pd.merge to each of the DataFrames:
result = functools.reduce(merge, dfs)
This is equivalent to
result = dfs[0]
for df in dfs[1:]:
result = merge(result, df)
To pass the on=['org', 'name'] argument, you could use functools.partial define the merge function:
merge = functools.partial(pd.merge, on=['org', 'name'])
Since specifying the suffixes parameter in functools.partial would only allow
one fixed choice of suffix, and since here we need a different suffix for each
pd.merge call, I think it would be easiest to prepare the DataFrames column
names before calling pd.merge:
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
For example,
import pandas as pd
import numpy as np
import functools
np.random.seed(2015)
N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)),
columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')},
inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())
yields
org name items_df1 spend_df1 items_df2 spend_df2 items_df3 \
0 2 4 4 2 3 0 1
1 2 4 4 2 3 0 1
2 2 4 4 2 3 0 1
3 2 4 4 2 3 0 1
4 2 4 4 2 3 0 1
spend_df3 items_df4 spend_df4 items_df5 spend_df5 items_df6 \
0 3 1 0 1 0 4
1 3 1 0 1 0 4
2 3 1 0 1 0 4
3 3 1 0 1 0 4
4 3 1 0 1 0 4
spend_df6 items_df7 spend_df7 items_df8 spend_df8 items_df9 spend_df9
0 3 4 1 3 0 1 2
1 3 4 1 3 0 0 3
2 3 4 1 3 0 0 0
3 3 3 1 3 0 1 2
4 3 3 1 3 0 0 3
Would doing a big pd.concat() and then renaming all the columns work for you? Something like:
desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)
new_columns = ['org', 'name']
for i in range(num_dataframes):
new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])
bid_df.columns = new_columns
This should give you columns like:
org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8
I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):
Create an empty dictionary, merge_dict.
Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
Generate a new index as sorted(merge_dict).
Generate a new list of data for each column by looping through merge_dict.items().
Create a new data frame with index=sorted(merge_dict) and columns created in the previous step.
Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.
Good luck.

Categories