Converting a Pandas GroupBy output from Series to DataFrame - python

I'm starting with input data like this
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
Which when printed appears as this:
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland Mallory
3 Seattle Mallory
4 Seattle Bob
5 Portland Mallory
Grouping is simple enough:
g1 = df1.groupby( [ "Name", "City"] ).count()
and printing yields a GroupBy object:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Seattle 1 1
But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:
City Name
Name City
Alice Seattle 1 1
Bob Seattle 2 2
Mallory Portland 2 2
Mallory Seattle 1 1
I can't quite see how to accomplish this in the pandas documentation. Any hints would be welcome.

g1 here is a DataFrame. It has a hierarchical index, though:
In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame
In [20]: g1.index
Out[20]:
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
('Mallory', 'Seattle')], dtype=object)
Perhaps you want something like this?
In [21]: g1.add_suffix('_Count').reset_index()
Out[21]:
Name City City_Count Name_Count
0 Alice Seattle 1 1
1 Bob Seattle 2 2
2 Mallory Portland 2 2
3 Mallory Seattle 1 1
Or something like:
In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]:
Name City count
0 Alice Seattle 1
1 Bob Seattle 2
2 Mallory Portland 2
3 Mallory Seattle 1

I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don't set it, you get an empty dataframe.
Source:
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.
Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.
nth can act as a reducer or a filter, see here.
import pandas as pd
df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
"City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
# City Name
#0 Seattle Alice
#1 Seattle Bob
#2 Portland Mallory
#3 Seattle Mallory
#4 Seattle Bob
#5 Portland Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
# City Name
#Name City
#Alice Seattle 1 1
#Bob Seattle 2 2
#Mallory Portland 2 2
# Seattle 1 1
#
EDIT:
In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:
print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range
print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]
print df1.groupby(["Name", "City"])[['Name','City']].count()
# Name City
#Name City
#Alice Seattle 1 1
#Bob Seattle 2 2
#Mallory Portland 2 2
# Seattle 1 1
print df1.groupby(["Name", "City"]).size().reset_index(name='count')
# Name City count
#0 Alice Seattle 1
#1 Bob Seattle 2
#2 Mallory Portland 2
#3 Mallory Seattle 1
The difference between count and size is that size counts NaN values while count does not.

The key is to use the reset_index() method.
Use:
import pandas
df1 = pandas.DataFrame( {
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()
Now you have your new dataframe in g1:

Simply, this should do the task:
import pandas as pd
grouped_df = df1.groupby( [ "Name", "City"] )
pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))
Here, grouped_df.size() pulls up the unique groupby count, and reset_index() method resets the name of the column you want it to be.
Finally, the pandas Dataframe() function is called upon to create a DataFrame object.

Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.
example code unrelated to question
df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])

I found this worked for me.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1['City_count'] = 1
df1['Name_count'] = 1
df1.groupby(['Name', 'City'], as_index=False).count()

Below solution may be simpler:
df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()

I have aggregated with Qty wise data and store to dataframe
almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
)['Qty'].sum()}).reset_index()

This returns the ordinal levels/indices in the same order as a vanilla groupby() method. It's basically the same as the answer #NehalJWani posted in his comment, but stored in a variable with the reset_index() method called on it.
fare_class = df.groupby(['Satisfaction Rating','Fare Class']).size().to_frame(name = 'Count')
fare_class.reset_index()
This version not only returns the same data with percentages which is useful for stats, but also includes a lambda function.
fare_class_percent = df.groupby(['Satisfaction Rating', 'Fare Class']).size().to_frame(name = 'Percentage')
fare_class_percent.transform(lambda x: 100 * x/x.sum()).reset_index()
Satisfaction Rating Fare Class Percentage
0 Dissatisfied Business 14.624269
1 Dissatisfied Economy 36.469048
2 Satisfied Business 5.460425
3 Satisfied Economy 33.235294
Example:

These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:
Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a "scaffolding" dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.
df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)
# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0
def manualAggregations(indices_array):
temp_df = df.iloc[indices_array]
return {
'Male Count': temp_df['Male Count'].sum(),
'Female Count': temp_df['Female Count'].sum(),
'Job Rate': temp_df['Hourly Rate'].max()
}
for name, group in df_grouped:
ix = df_grouped.indices[name]
calcDict = manualAggregations(ix)
for key in calcDict:
#Salary Basis, Job Title
columns = list(name)
df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]
If a dictionary isn't your thing, the calculations could be applied inline in the for loop:
df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) &
(df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()

grouped=df.groupby(['Team','Year'])['W'].count().reset_index()
team_wins_df=pd.DataFrame(grouped)
team_wins_df=team_wins_df.rename({'W':'Wins'},axis=1)
team_wins_df['Wins']=team_wins_df['Wins'].astype(np.int32)
team_wins_df.reset_index()
print(team_wins_df)

Try to set group_keys=False in the group_by method to prevent adding the group key to the index.
Example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
df1.groupby(["Name"], group_keys=False)

Related

pandas to_json exclude the groupby keys

How do we exclude the grouped by key from the to_json method ?
import pandas as pd
students_df = pd.DataFrame(
[
["Jay", 16, "Soccer"],
["Jack", 19, "FootBall"],
["Dorsey", 19, "Dining"],
["Mark", 18, "Swimming"],
],
columns=["Name", "Age", "Sport"],
)
students_df.groupby("Name").apply(lambda x: x.to_json(orient="records")).reset_index(
name="students_json"
)
Current output:
Name students_json
0 Dorsey [{"Name":"Dorsey","Age":19,"Sport":"Dining"}]
1 Jack [{"Name":"Jack","Age":19,"Sport":"FootBall"}]
2 Jay [{"Name":"Jay","Age":16,"Sport":"Soccer"}]
3 Mark [{"Name":"Mark","Age":18,"Sport":"Swimming"}]
I want to exclude the grouped by key from the resulting json.
There could be multiple keys on which I can group on not just name.
Expected output should be:
Name students_json
0 Dorsey [{"Age":19,"Sport":"Dining"}]
1 Jack [{"Age":19,"Sport":"FootBall"}]
2 Jay [{"Age":16,"Sport":"Soccer"}]
3 Mark [{"Age":18,"Sport":"Swimming"}]
You could drop it:
out = students_df.groupby('Name').apply(lambda x: x.drop(columns='Name').to_json(orient="records"))
Output:
Name
Dorsey [{"Age":19,"Sport":"Dining"}]
Jack [{"Age":19,"Sport":"FootBall"}]
Jay [{"Age":16,"Sport":"Soccer"}]
Mark [{"Age":18,"Sport":"Swimming"}]
dtype: object
Specify which columns you want in the json.
students_df.groupby("Name").apply(
lambda x: x[["Age", "Sport"]].to_json(orient="records")).reset_index(name="students_json")
Name students_json
0 Dorsey [{"Age":19,"Sport":"Dining"}]
1 Jack [{"Age":19,"Sport":"FootBall"}]
2 Jay [{"Age":16,"Sport":"Soccer"}]
3 Mark [{"Age":18,"Sport":"Swimming"}]

How to implode (reverse of explode) only non-null values in pandas. Merge multiple rows into single row using pandas group by

I am working on Python Pandas.
I have a pandas dataframe with columns like this:
ID
Cities
1
New York
1
''
1
Atlanta
2
Tokyo
2
Kyoto
2
''
3
Paris
3
Bordeaux
3
''
4
Mumbai
4
''
4
Bangalore
5
London
5
''
5
Bermingham
Note the empty cells in the column are either empty string ('') or Nan or None. (For simplicity lets just say they are empty strings ('')).
And I want the result to be like this:
ID
Cities
1
New York, Atlanta
2
Tokyo, Kyoto
3
Paris, Bordeaux
4
Mumbai, Bangalore
5
London, Bermingham
In short, I want to group by ID and then getting the list (by removing the empty strings).
I have a sample code for this but it actually gives me result with empty strings, I want to remove empty strings.
dataFrame.groupby(['ID'], as_index=False)
.agg({'Cities': lambda x: x.tolist()})
It gives me result like this:
ID
Cities
1
New York, ,Atlanta
2
Tokyo, Kyoto,
3
Paris, Bordeaux,
4
Mumbai, , Bangalore
5
London, , Bermingham
But I dont want empty strings...
Please help me here.
Thank you so much for you help.
You can try replacing empty string by NaN and then add .dropna() to the aggregate lambda function, as follows:
df['Cities'] = df['Cities'].replace('', np.nan)
(df.groupby('ID', as_index=False)
.agg({'Cities': lambda x: x.dropna().tolist()})
)
Result:
ID Cities
0 1 [New York, Atlanta]
1 2 [Tokyo, Kyoto]
2 3 [Paris, Bordeaux]
3 4 [Mumbai, Bangalore]
4 5 [London, Bermingham]
We can also perform the operations at the Series level, by mask out the unneeded values like empty string (''), dropna to remove the missing/empty values, then groupby aggregate into whatever type needed, like a list:
new_df = (
df['Cities']
.mask(df['Cities'].eq("")) # Replace Empty String with NaN
.dropna() # Exclude NaN
.groupby(df['ID']) # Groupby ID
.aggregate(list) # Join Into List
.reset_index() # Convert Back to DataFrame
)
Or filter out unneeded rows by condition:
new_df = (
# Filter out by condition
df.loc[df['Cities'].ne("") & df['Cities'].notnull(), 'Cities']
.groupby(df['ID']) # Groupby ID
.aggregate(list) # Join Into List
.reset_index() # Convert Back to DataFrame
)
new_df:
ID Cities
0 1 [New York, Atlanta]
1 2 [Tokyo, Kyoto]
2 3 [Paris, Bordeaux]
3 4 [Mumbai, Bangalore]
4 5 [London, Bermingham]
Setup:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'Cities': ['New York', "", 'Atlanta', 'Tokyo', 'Kyoto', "", 'Paris',
'Bordeaux', "", 'Mumbai', "", 'Bangalore', 'London', "",
'Bermingham']
})

Using df1 as a lookup table for df2, df2 has more unique values than df1 in Python

I have a df with US citizens state and I would like to use that as a lookup for world citizens
df1=
[Sam, New York;
Nick, California;
Sarah, Texas]
df2 =
[Sam;
Phillip;
Will;
Sam]
I would like to either df2.replace() with the states or create df3 where my output is:
[New York;
NaN;
NaN;
New York]
I have tried mapping with set_index and dict(zip()) but have had no luck so far.
Thank you.
How about this method:
import pandas as pd
df1 = pd.DataFrame([['Sam','New York'],['Nick','California'],['Sarah','Texas']],\
columns = ['name','state'])
display(df1)
df2 = pd.DataFrame(['Sam','Phillip','Will','Sam'],\
columns = ['name'])
display(df2)
df2.merge(right=df1,left_on='name',right_on='name',how='left')
resulting in
name state
0 Sam New York
1 Nick California
2 Sarah Texas
name
0 Sam
1 Phillip
2 Will
3 Sam
name state
0 Sam New York
1 Phillip NaN
2 Will NaN
3 Sam New York
you can then filter for just the state column in the merged dataframe

Splitting DataFrame into DataFrame's

I have one DataFrame where different rows can have the same value for one column.
As an example:
import pandas as pd
df = pd.DataFrame( {
"Name" : ["Alice", "Bob", "John", "Mark", "Emma" , "Mary"] ,
"City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )
City Name
0 Seattle Alice
1 Seattle Bob
2 Portland John
3 Seattle Mark
4 Seattle Emma
5 Portland Mary
Here, a given value for "City" (e.g. "Portland") is shared by several rows.
I want to create from this data frame several data frames that have in common the value of one column. For the example above, I want to get the following data frames:
City Name
0 Seattle Alice
1 Seattle Bob
3 Seattle Mark
4 Seattle Emma
and
City Name
2 Portland John
5 Portland Mary
From this answer, I am creating a mask that can be used to generate one data frame:
def mask_with_in1d(df, column, val):
mask = np.in1d(df[column].values, [val])
return df[mask]
# Return the last data frame above
mask_with_in1d(df, 'City', 'Portland')
The problem is to create efficiently all data frames, to which a name will be assigned. I am doing it this way:
unique_values = np.sort(df['City'].unique())
for city_value in unique_values:
exec("df_{0} = mask_with_in1d(df, 'City', '{0}')".format(city_value))
which gives me the data frames df_Seattle and df_Portland that I can further manipulate.
Is there a better way of doing this?
Have you got a fixed list of cities you want to do this for? Simplest solution is to group by city and can then loop over the groups
for city, names in df.groupby("City"):
print(city)
print(names)
Portland
City Name
2 Portland John
5 Portland Mary
Seattle
City Name
0 Seattle Alice
1 Seattle Bob
3 Seattle Mark
4 Seattle Emma
Could then assign to a dictionary or some such (df_city[city] = names) if you wanted df_city["Portland"] to work. Depends what you want to do with the groups once split.
You can use groupby for this:
dfs = [gb[1] for gb in df.groupby('City')]
This will construct a list of dataframes, one per value of the 'City' column.
In case you want tuples with the value of the dataframe, you can use:
dfs = list(df.groupby('City'))
Note that assigning by name is usually an anti-pattern. And exec and eval are definitely antipatterns.

Compare PandaS DataFrames and return rows that are missing from the first one

I have 2 dataFrames and want to compare them and return rows from the first one (df1) that are not in the second one (df2). I found a way to compare them and return the differences, but can't figure out how to return only missing ones from df1.
import pandas as pd
from pandas import Series, DataFrame
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
blah = df.reindex(idx)
Building on #EdChum's suggestion:
df = pd.merge(df1, df2, how='outer', suffixes=('','_y'), indicator=True)
rows_in_df1_not_in_df2 = df[df['_merge']=='left_only'][df1.columns]
rows_in_df1_not_in_df2
|Index |City |State |
|------|------------|------------|
|1 |San Franciso|California |
|2 |Boston |Massachusett|
EDIT: incorporate #RobertPeters suggestion
IIUC then if you're using pandas version 0.17.0 then you can use merge and set indicator=True:
In [80]:
df1 = pd.DataFrame( {
"City" : ["Chicago", "San Franciso", "Boston"] ,
"State" : ["Illinois", "California", "Massachusett"] } )
​
df2 = pd.DataFrame( {
"City" : ["Chicago", "Mmmmiami", "Dallas" , "Omaha"] ,
"State" : ["Illinois", "Florida", "Texas", "Nebraska"] } )
pd.merge(df1,df2, how='outer', indicator=True)
Out[80]:
City State _merge
0 Chicago Illinois both
1 San Franciso California left_only
2 Boston Massachusett left_only
3 Mmmmiami Florida right_only
4 Dallas Texas right_only
5 Omaha Nebraska right_only
This adds a column to indicator whether the rows are only present in either lhs or rhs
You can also use a list comprehension and compare the rows to return the missing elements:
dif_list = [x for x in list(df1['City'].unique()) if x not in list(df2['City'].unique())]
returns:
['San Franciso', 'Boston']
You could then get a dataframe with just the rows that are different:
dfdif = df1[(df1['City'].isin(dif_list))]
If you're on pandas < 0.17.0
You could work your way up like
In [182]: df = pd.merge(df1, df2, on='City', how='outer')
In [183]: df
Out[183]:
City State_x State_y
0 Chicago Illinois Illinois
1 San Franciso California NaN
2 Boston Massachusett NaN
3 Mmmmiami NaN Florida
4 Dallas NaN Texas
5 Omaha NaN Nebraska
In [184]: df.ix[df['State_y'].isnull(),:]
Out[184]:
City State_x State_y
1 San Franciso California NaN
2 Boston Massachusett NaN

Categories