Get count of count of unique values in pandas dataframe - python

I'm trying to get counts of unique counts of unique values for a column in pandas dataframe.
Sample data bellow:
In [3]: df = pd.DataFrame([[1, 1], [2, 1], [3, 2], [4, 3], [5, 1]], columns=['AppointmentId', 'PatientId'])
In [4]: df
Out[4]:
AppointmentId PatientId
0 1 1
1 2 1
2 3 2
3 4 3
4 5 1
Actual dataset has over 50000 unique values of PatientId. I want to visualize appointment count per patient, but simply grouping by PatientId and getting sizes of groups doesn't work well for plotting, because that would be 50000 bars.
For that reason I'm trying to plot how many patients had a specific number of appointments plotted, instead of plotting number of appointments against PatientId.
Based on sample data above I want to get something like this:
AppointmentCount PatientCount
0 1 2
1 3 3
I approach this by first grouping on PatientId and getting group sizes, drop PatientId, and group sizes, but I can't find a way to extract it after grouping.
In [24]: appointment_counts = df.groupby('PatientId').size()
In [25]: appointment_counts
Out[25]:
PatientId
1 3
2 1
3 1
dtype: int64
In [26]: type(appointment_counts)
Out[26]: pandas.core.series.Series

After your groupby adding value_counts
df.groupby('PatientId').size().value_counts()
Out[877]:
1 2
3 1
dtype: int64
Then you can add rename
df.groupby('PatientId').size().value_counts().reset_index().rename(columns={'index':'Aid',0:'Pid'})
Out[883]:
Aid Pid
0 1 2
1 3 1

Related

why is the NaN not being included in the count when i do a groupby? PYTHON "size" does NOT work as it doesn't give the count of NaNs [duplicate]

That is the difference between groupby("x").count and groupby("x").size in pandas ?
Does size just exclude nil ?
size includes NaN values, count does not:
In [46]:
df = pd.DataFrame({'a':[0,0,1,2,2,2], 'b':[1,2,3,4,np.NaN,4], 'c':np.random.randn(6)})
df
Out[46]:
a b c
0 0 1 1.067627
1 0 2 0.554691
2 1 3 0.458084
3 2 4 0.426635
4 2 NaN -2.238091
5 2 4 1.256943
In [48]:
print(df.groupby(['a'])['b'].count())
print(df.groupby(['a'])['b'].size())
a
0 2
1 1
2 2
Name: b, dtype: int64
a
0 2
1 1
2 3
dtype: int64
What is the difference between size and count in pandas?
The other answers have pointed out the difference, however, it is not completely accurate to say "size counts NaNs while count does not". While size does indeed count NaNs, this is actually a consequence of the fact that size returns the size (or the length) of the object it is called on. Naturally, this also includes rows/values which are NaN.
So, to summarize, size returns the size of the Series/DataFrame1,
df = pd.DataFrame({'A': ['x', 'y', np.nan, 'z']})
df
A
0 x
1 y
2 NaN
3 z
<!- _>
df.A.size
# 4
...while count counts the non-NaN values:
df.A.count()
# 3
Notice that size is an attribute (gives the same result as len(df) or len(df.A)). count is a function.
1. DataFrame.size is also an attribute and returns the number of elements in the DataFrame (rows x columns).
Behaviour with GroupBy - Output Structure
Besides the basic difference, there's also the difference in the structure of the generated output when calling GroupBy.size() vs GroupBy.count().
df = pd.DataFrame({
'A': list('aaabbccc'),
'B': ['x', 'x', np.nan, np.nan,
np.nan, np.nan, 'x', 'x']
})
df
A B
0 a x
1 a x
2 a NaN
3 b NaN
4 b NaN
5 c NaN
6 c x
7 c x
Consider,
df.groupby('A').size()
A
a 3
b 2
c 3
dtype: int64
Versus,
df.groupby('A').count()
B
A
a 2
b 0
c 2
GroupBy.count returns a DataFrame when you call count on all column, while GroupBy.size returns a Series.
The reason being that size is the same for all columns, so only a single result is returned. Meanwhile, the count is called for each column, as the results would depend on on how many NaNs each column has.
Behavior with pivot_table
Another example is how pivot_table treats this data. Suppose we would like to compute the cross tabulation of
df
A B
0 0 1
1 0 1
2 1 2
3 0 2
4 0 0
pd.crosstab(df.A, df.B) # Result we expect, but with `pivot_table`.
B 0 1 2
A
0 1 2 1
1 0 0 1
With pivot_table, you can issue size:
df.pivot_table(index='A', columns='B', aggfunc='size', fill_value=0)
B 0 1 2
A
0 1 2 1
1 0 0 1
But count does not work; an empty DataFrame is returned:
df.pivot_table(index='A', columns='B', aggfunc='count')
Empty DataFrame
Columns: []
Index: [0, 1]
I believe the reason for this is that 'count' must be done on the series that is passed to the values argument, and when nothing is passed, pandas decides to make no assumptions.
Just to add a little bit to #Edchum's answer, even if the data has no NA values, the result of count() is more verbose, using the example before:
grouped = df.groupby('a')
grouped.count()
Out[197]:
b c
a
0 2 2
1 1 1
2 2 3
grouped.size()
Out[198]:
a
0 2
1 1
2 3
dtype: int64
When we are dealing with normal dataframes then only difference will be an inclusion of NAN values, means count does not include NAN values while counting rows.
But if we are using these functions with the groupby then, to get the correct results by count() we have to associate any numeric field with the groupby to get the exact number of groups where for size() there is no need for this type of association.
In addition to all above answers, I would like to point out one more difference which I find significant.
You can correlate pandas' DataFrame size and count with Java's Vectors size and length. When we create a vector, some predefined memory is allocated to it. When we reach closer to the maximum number of elements it can hold, more memory is allocated to accommodate further additions. Similarly, in DataFrame as we add elements, the memory allocated to it increases.
The size attribute gives the number of memory cell allocated to DataFrame whereas count gives the number of elements that are actually present in DataFrame. For example,
You can see that even though there are 3 rows in DataFrame, its size is 6.
This answer covers size and count difference with respect to DataFrame and not pandas Series. I have not checked what happens with Series.

Python DataFrame count how many different elements

I need to count how many different elements are in my DataFrame (df).
My df has the day of the month (as a number: 1,2,3 ... 31) in which a certain variable was measured. There are 3 columns that describe the number of the day. There are multiple measurements in one day so my columns have repeated values. I need to know how many days in a month was that variable measured ignoring how many times a day was that measurement done. So I was thinking that counting the days ignoring repeated values.
As an example the data of my df would look like this:
col1 col2 col3
2 2 2
2 2 3
3 3 3
3 4 8
I need an output that tells me that in that DataFrame the numbers are 2, 3, 4 and 8.
Thanks!
Just do:
df=pd.DataFrame({"col1": [2,2,3,3], "col2": [2,2,3,4], "col3": [2,3,3,8]})
df.stack().unique()
Outputs:
[2 3 4 8]
You can use the function drop_duplicates into your dataframe, like:
import pandas as pd
df = pd.DataFrame({'a':[2,2,3], 'b':[2,2,3], 'c':[2,2,3]})
a b c
0 2 2 2
1 2 2 2
2 3 3 3
df = df.drop_duplicates()
print(df['a'].count())
out: 2
Or you can use numpy to get the unique values in the dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X' : [2, 2, 3, 3], 'Y' : [2,2,3,4], 'Z' : [2,3,3,8]})
df_unique = np.unique(np.array(df))
print(df_unique)
#Output [2 3 4 8]
#for the count of days:
print(len(df_unique))
#Output 4
How about:
Assuming this is your initial df:
col1 col2 col3
0 2 2 2
1 2 2 2
2 3 3 3
Then:
count_df = pd.DataFrame()
for i in df.columns:
df2 = df[i].value_counts()
count_df = pd.concat([count_df, df2], axis=1)
final_df = count_df.sum(axis=1)
final_df = pd.DataFrame(data=final_df, columns=['Occurrences'])
print(final_df)
Occurrences
2 6
3 3
You can use pandas.unique() like so:
pd.unique(df.to_numpy().flatten())
I have done some basic benchmarking, this method appears to be the fastest.

Summing columns according to pattern in column names

Lets start with very simplified abstract example, I hava a dataframe like this:
import pandas as pd
d = {'1-A': [1, 2], '1-B': [3, 4], '2-A': [3, 4], '5-B': [2, 7]}
df = pd.DataFrame(data=d)
1-A 1-B 2-A 5-B
0 1 3 3 2
1 2 4 4 7
I'm looking for elegant pandastic solution to have dataframe like this:
1 2 5
0 4 3 2
1 6 4 7
To make example more concrete column 1-A, means person id=1, expenses category A. Rows are expenses every month. In result, I want to have monthly expenses per person across categories (so column 1 is sum of column 1-A and 1-B). Note that, when there is no expenses, there is no column with 0s. Of course it should be ready for more columns (ids and categories).
I'm quite sure that smart solution with good separation of column selection and summing opeation for this exist.
Use groupby with lambda function with split and select first value, for grouping by columns add axis=1:
df1 = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
#alternative
#df1 = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
print (df1)
1 2 5
0 4 3 2
1 6 4 7

Pandas Multiple Column Division

I am trying to do a division of column 0 by columns 1 and 2. From the below, I would like to return a dataframe of 10 rows, 3 columns. The first column should all be 1's. Instead I get a 10x10 dataframe. What am I doing wrong?
data = np.random.randn(10,3)
df = pd.DataFrame(data)
df[0] / df
First you should create a 10 by 3 DataFrame with all columns equal to the first column and then divide it with your DataFrame.
df[[0, 0, 0]] / df.values
or
df[[0, 0, 0]].values / df
If you want to keep the column names.
(I use .values to avoid reindexing which will fail due to duplicate column values.)
You need to match the dimension of the Series with the rows of the DataFrame. There are a few ways to do this but I like to use transposes.
data = np.random.randn(10,3)
df = pd.DataFrame(data)
(df[0] / df.T).T
0 1 2
0 1 -0.568096 -0.248052
1 1 -0.792876 -3.539075
2 1 -25.452247 1.434969
3 1 -0.685193 -0.540092
4 1 0.451879 -0.217639
5 1 -2.691260 -3.208036
6 1 0.351231 -1.467990
7 1 0.249589 -0.714330
8 1 0.033477 -0.004391
9 1 -0.958395 -1.530424

What is the fastest way to build a DataFrame piece by piece?

I am downloading price data from bloomberg and want to build a DataFrame in the fastest and least memory intensive way. Let's say I submit a data request to bloomberg through python for the price data for all current S&P 500 stocks from 1-1-2000 to 1-1-2013. Data is returned by ticker and then date and value, one at a time. My current method is to create a list for the dates to be stored in and another list for the prices to be stored in, and to append a date and price to each list as they are read from the Bloomberg data request response. Then when all the dates and prices are read for the particular ticker, I create a DataFrame for the ticker using
ticker_df = pd.DataFrame(price_list, index = dates_list, columns= [ticker], dtype=float)
I do this for each ticker, appending each ticker dataframe to a list << df_list.append(ticker_df) >> after each ticker's data is read. When all the ticker dataframes are made, then I combine all the individual DataFrames into one DataFrame:
lg_index = []
for num in range(len(df_list)):
if len(lg_index) < len(df_list[num].index):
lg_index = df_list[num].index # Use the largest index for creating the result_df
result_df = pd.DataFrame(index= lg_index)
for num in range(len(df_list)):
result_df[df_list[num].columns[0]] = df_list[num]
The reason why I do it this way, is because the indexes for each ticker are not identical (if a stock only IPO'd last year, etc.)
I'm guessing there must be a better way to accomplish what I'm doing here using less memory and in a faster way, I just can't think of it. Thanks!
I'm not 100% sure which your after, but you can concat a list of DataFrames:
pd.concat(df_list)
For example:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]])
In [12]: pd.concat([df, df, df])
Out[12]:
0 1
0 1 2
1 3 4
0 1 2
1 3 4
0 1 2
1 3 4
In [13]: pd.concat([df, df, df], axis=1)
Out[13]:
0 1 0 1 0 1
0 1 2 1 2 1 2
1 3 4 3 4 3 4
or do an outer merge/join:
In [14]: df1 = pd.DataFrame([[1, 2]], columns=[0, 2])
In [15]: df.merge(df1, how='outer') # do several of these
Out[15]:
0 1 2
0 1 2 2
1 3 4 NaN
See the merge, join, concatenate section of the docs.

Categories