How to show zero counts in pandas groupby for large dataframes

How to show zero counts in pandas groupby for large dataframes - python

I have a dataframe where I want to see the zero counts. It's essentially a duplicate of this question: Pandas Groupby How to Show Zero Counts in DataFrame
But unfortunately the answer is not a duplicate one. Whenever I try the MultiIndex.from_product approach, I get the error:
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
which is because I have several unique values for the groupby. I've confirmed though that the same script works for much smaller dataframes with fewer unique indices (and therefore, fewer elements in df.index.levels[i].values).
Here's an idea on the dataframe that I'm working with:
user1 user2 hour
-------------------
Alice Bob 0
Alice Carol 1
Alice Bob 13
Bob Eve 2
to
user1 user2 hour count
-------------------------------
Alice Bob 0 1
Alice Bob 1 0
Alice Bob 2 0
and so on but what I get is
user1 user2 hour count
-------------------------------
Alice Bob 0 1
Alice Bob 13 1
Alice Carol 1 1
However, I have ~1.2M unique combinations of user1-user2, so MultiIndex.from_product doesn't work.
EDIT: Here's the code I used for some dummy dataframe. It works for the dummy case, but not for the larger case:
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,3],'hour':[0,1,0,0,1,1], 'to_count': [20,10,5,4,17,6]})
print(df)
agg_df = df.groupby(['id', 'hour']).agg({'to_count': 'count'})
print(df.groupby(['id', 'hour']).agg({'to_count':'count'}))
print(len(agg_df.index.levels))
levels = [agg_df.index.levels[i].values for i in range(len(agg_df.index.levels))]
levels[-1] = [0,1,2]
print(len(levels))
print(agg_df.index.names)
new_index = pd.MultiIndex.from_product(levels, names=agg_df.index.names)
# Reindex the agg_df and fill empty values with zero (NaN by default)
agg_df = agg_df.reindex(new_index, fill_value=0)
# Reset index
agg_df = agg_df.reset_index()
Is there a better way to show zero counts for groupby in large pandas dataframes?

Related

Join two Pandas dataframes, sampling from the smaller dataframe

I have two dataframes that look as follows:
import pandas as pd
import io
train_data="""input_example,user_id
example0.npy, jane
example1.npy, bob
example4.npy, alice
example5.npy, jane
example3.npy, bob
example2.npy, bob
"""
user_data="""user_data,user_id
data_jane0.npy, jane
data_jane1.npy, jane
data_bob0.npy, bob
data_bob1.npy, bob
data_alice0.npy, alice
data_alice1.npy, alice
data_alice2.npy, alice
"""
train_df = pd.read_csv(io.StringIO(train_data), sep=",")
user_df = pd.read_csv(io.StringIO(user_data), sep=",")
Suppose that the train_df table is many thousands of entries long, i.e., there are 1000s of unique "exampleN.npy" files. I was wondering if there was a straightforward way to merge the train_df and user_df tables where each row of the joined table matches on the key user_id but is subsampled from user_df.
Here is one example of a resulting dataframe (I'm trying to do uniform sampling, so theoretically, there are infinite possible result dataframes):
>>> result_df
input_example user_data user_id
0 example0.npy data_jane0.npy jane
1 example1.npy data_bob1.npy bob
2 example4.npy data_alice0.npy alice
3 example5.npy data_jane1.npy jane
4 example3.npy data_bob0.npy bob
5 example2.npy data_bob0.npy bob
That is, the user_data column is filled with a random choice of filename based on the corresponding user_id.
I know one could write this using some multi-line for-loop query-based approach, but perhaps there was a faster way using built-in Pandas functions, e.g., "sample", "merge", "join", or "combine".

You can sample by groups in user_df and then join that with train_df.
e.g.,
# this samples by fraction so each data is equally likely
user_df = user_df.groupby("user_id").sample(frac=0.5, replace=True)
user_data user_id
6 data_alice2.npy alice
4 data_alice0.npy alice
3 data_bob1.npy bob
0 data_jane0.npy jane
or
# this will sample 2 samples per group
user_df = user_df.groupby("user_id").sample(n=2, replace=True)
user_data user_id
6 data_alice2.npy alice
4 data_alice0.npy alice
2 data_bob0.npy bob
2 data_bob0.npy bob
0 data_jane0.npy jane
1 data_jane1.npy jane
Join
pd.merge(train_df, user_df)

I don't know if it is possible to merge with a sample without first merging both. This doesn't include a multi-line for loop:
merged = train_df.merge(user_df, on="user_id", how="left").\
groupby("input_example", as_index=False).\
apply(lambda x: x.sample(1)).\
reset_index(drop=True)
merge the two together, on "user_id", only taking those that appear in the left
group by "input_example", assuming these will all be unique (other could group on both columns of train_df)
take a sample of size 1 for these
reset the index
Sampling second, after the merge, means that rows with the same user_id will not necessarily be the same (but sampling user_df first would result in all rows in the output dataframe with the same user_id).

Think I figured out a solution myself, it's a one-liner but conceptually it's the same as what #Rawson suggested. First, I do a left-merge, which results in a table with many duplicates. Then I shuffle all the rows to give it randomness. Finally, I drop the duplicates. If I add "sort_index", the resulting table will have the same ordering as the original table.
I'm able to use the random_state kwarg to switch up which user_data file is used. See here:
>>> train_df.merge(user_df, on='user_id', how='left').sample(frac=1, random_state=0).drop_duplicates('input_example').sort_index()
input_example user_id user_data
1 example0.npy jane data_jane1.npy
2 example1.npy bob data_bob0.npy
6 example4.npy alice data_alice2.npy
8 example5.npy jane data_jane1.npy
10 example3.npy bob data_bob1.npy
11 example2.npy bob data_bob0.npy
>>> train_df.merge(user_df, on='user_id', how='left').sample(frac=1, random_state=1).drop_duplicates('input_example').sort_index()
input_example user_id user_data
1 example0.npy jane data_jane1.npy
2 example1.npy bob data_bob0.npy
4 example4.npy alice data_alice0.npy
7 example5.npy jane data_jane0.npy
10 example3.npy bob data_bob1.npy
12 example2.npy bob data_bob1.npy

Pandas Dataframe - Turn Columns To Rows Sorted by "Grand Totals" Row

I have data in a dataframe that looks like this, where each column is a KEYWORD and each row is an observation of how many times each ID said the word:
id
bagel
pizza
ABC
2
3
DEF
1
3
GHI
7
9
TOTAL
10
15
I am trying to get it to a form where I can see what the most popular word is overall, something where the columns themselves are new columns and the TOTAL row transforms to a column that can be sorted:
Column
Total
bagel
10
pizza
15
I have tried melt and stack but I dont think I am using either one correctly. Any help is appreciated.

Select the column then T
out = df[df.id.eq('TOTAL')].set_index('id').T.reset_index()
Out[433]:
id index TOTAL
0 bagel 10
1 pizza 15

You can use df.sum()
data = df.sum(numeric_only=True, axis=0)
The code above will return a series, you need to convert it into a DataFrame using the syntax below and set the column names.
df = pd.DataFrame({'Column':data.index, 'Total':data.values})
print(df)
That gives me,
Column Total
0 bagel 10
1 pizza 15
You can also do the following to set the Column column as the index removing the (0, 1, etc.) index.
df = df.set_index('Column')
print(df)
Which gives me,
Total
Column
bagel 10
pizza 15

Dropping rows, where a dynamic number of integer columns only contain 0's

I have the following problem - I have the following example data-frame:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Peter
Terra
0
0
Anna
Mars
5
4
Robert
Knowhere
0
1
Here, I want to only remove the rows, in which there are numbers, that are all 0's. In this case this is the second row. So my data-frame has to become like this:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Anna
Mars
5
4
Robert
Knowhere
0
1
For this, I have a solution and it is the following:
new_df = old_df.loc[(a['Number Column #1'] > 0) + (a['Number Column #2'] > 0)]
This works, however I have another problem. My dataframe, based on the request, will dynamically have a different number of number columns. For example:
Name
Planet
Number Column #1
Number Column #2
Number Column #3
John
Earth
2
0
1
Peter
Terra
0
0
0
Anna
Mars
5
4
2
Robert
Knowhere
0
1
1
This is the problematic part, as I am not sure how I can adjust my code to work for dynamic columns. I've tried multiple things from StackOverflow and the Pandas documentation - however most examples only work for dataframes, in which all columns are integers. Pandas would them consider them booleans, and you can add a simple solution like this:
new_df = (df != 0).any(axis=1)
In my case however, the text columns, which are always the same, are the problematic ones. Does anyone have an idea for a solution here? Thanks a lot in advance!
P.S. I have the names of the number columns available beforehand in the code as a list, for example:
my_num_columns = ["Number Column #1", "Number Column #2", "Number Column #3"]
# my pandas logic...

IIUC:
You can try via select_dtypes() and select int and float columns after that check your condition and filter out your dataframe:
df=df.loc[~df.select_dtypes(['int','float']).eq(0).all(axis=1)]
#OR
df=df.loc[df.select_dtypes(['int','float']).ne(0).any(axis=1)]
Note: If needed you can also include 'bool' columns and typecast it to float and then check your condition:
df=df.loc[df.select_dtypes(['int','float','bool']).astype(float).ne(0).any(axis=1)]

Comparing rows of string inside groupby and assigning a value to a new column pandas

I have a dataset of employees (their IDs) and the names of their bosses for several years.
df:
What I need to do is to see if an employee had a boss' change. So, desired output is:
For employees who appear in the df only once, I just assign 0 (no boss' change). However, I cannot figure out how to do it for the employees who are in the df for several years.
I was thinking that first I need to assign 0 for the first year they appear in the df (because we do not know who was the boss before, therefore there is no boss' change). Then I need to compare the name of the boss with the name in the next row and decide whether to assign 1 or 0 into the ManagerChange column.
So far I split the df into two (with unique IDs and duplicated IDs) and assigned 0 to ManagerChange for the unique IDs.
Then I groupby the duplicated IDs and sort them by year. However, I am new to Python and cannot figure out how to compare strings and assign a result value to a new column inside the groupby. Please, help.
Code I have so far:
# splitting database in two
bool_series = df["ID"].duplicated(keep=False)
df_duplicated=df[bool_series]
df_unique = df[~bool_series]
# assigning 0 for ManagerChange for the unique IDs
df_unique['ManagerChange'] = 0
# groupby by ID and sorting by year for the duplicated IDs
df_duplicated.groupby('ID').apply(lambda x: x.sort_values('Year'))

You can groupby then shift() the group and compare on Boss columns.
# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)
# Compare Boss column with shifted Boss column
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1)).tolist()
# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})
# Sort df to original df
df = df.sort_index()
# Change the first in each group to 0
df.loc[df.groupby('ID').head(1).index, 'ManagerChange'] = 0
# print(df)
ID Year Boss ManagerChange
0 1234 2018 Anna 0
1 567 2019 Sarah 0
2 1234 2020 Michael 0
3 8976 2019 John 0
4 1234 2019 Michael 1
5 8976 2020 John 0
You could also make use of fill_value argument, this will help you get rid of the last df.loc[] operation.
# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1, fill_value=group['Boss'].iloc[0])).tolist()
# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})
# Sort df to original df
df = df.sort_index()

How to sum list of pandas dataframes by with respect to given column

I have list of pandas dataframes with two columns, basically class and value:
df1:
Name
Count
Bob
10
John
20
df2:
Name
Count
Mike
30
Bob
40
There might be same "Names" in different dataframes, there might be no same "Names", and list contains over 100 dataframes. But in each dataframe all "Names" are unique.
What I need is to iterate over all dataframes and create one big one, where presented all values from "Names" and their total sums of "Count" from all the dataframes, so like:
result:
Name
Count
Bob
50
John
20
Mike
30
Bob's data is summed, others are not, as they are only present once. Is there efficient way once there are many dataframes?

do pd.concat then groupby:
df = pd.concat(dfs) # where dfs is a list of dataframes
then you can do
gp = df.groupby(['Name'])['Count'].sum()

you can do the following (assuming you have ,more data that only conatined in one dataframe use fill_value=0 to still provide value..:
df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
>>> Name Count
0 Bob 50.0
1 John 20.0
2 Mike 30.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.