Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.
Related
This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 12 days ago.
Here is a simplified version of my dataframe (the number of persons in my dataframe is way more than 3):
df = pd.DataFrame({'Person':['John','David','Mary','John','David','Mary'],
'Sales':[10,15,20,11,12,18],
})
Person Sales
0 John 10
1 David 15
2 Mary 20
3 John 11
4 David 12
5 Mary 18
I would like to add a column "Total" to this data frame, which is the sum of total sales per person
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
What would be the easiest way to achieve this?
I have tried
df.groupby('Person').sum()
but the shape of the output is not congruent with the shape of df.
Sales
Person
David 27
John 21
Mary 38
What you want is the transform method which can apply a function on each group:
df['Total'] = df.groupby('Person')['Sales'].transform(sum)
It gives as expected:
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
The easiest way to achieve this is by using the pandas groupby and sum functions.
df['Total'] = df.groupby('Person')['Sales'].sum()
This will add a column to the dataframe with the total sales per person.
your 'Persons' column in the dataframe contains repeated values
it is not possible to apply a new column to this via groupby
I would suggest making a new dataframe based on sales sum
The below code will help you with that
newDf = pd.DataFrame(df.groupby('Person')['Sales'].sum()).reset_index()
This will create a new dataframe with 'Person' and 'sales' as columns.
I have the following dataframe:
Name rollNumber external_roll_number testDate marks
0 John 34 234 2021-04-28 15
1 John 34 234 2021-03-28 25
I would like to convert it like this:
Name rollNumber external_roll_number testMonth marks testMonth marks
0 John 34 234 April 15 March 25
If the above is not possible then I would atleast want it to be like this:
Name rollNumber external_roll_number testDate marks testDate marks
0 John 34 234 2021-04-28 15 2021-03-28 25
How can I convert my dataframe to the desired output? This change will be based on the Name column of the rows.
EDIT 1
I tried using pivot_table like this but I did not get the desired result.
merged_df_pivot = pd.pivot_table(merged_df, index=["name", "testDate"], aggfunc="first", dropna=False).fillna("")
When I try to iterate through the merged_df_pivot like this:
for index, details in merged_df_pivot.iterrows():
I am again getting two rows and also I was not able to add the new testMonth column by the above method.
core is unstack() month to be columns
detail then to re-structure month-by month marks columns to required structure
generally consider bad practice to have duplicate column names, hence have suffixed them
df = pd.read_csv(io.StringIO(""" Name rollNumber external_roll_number testDate marks
0 John 34 234 2021-04-28 15
1 John 34 234 2021-03-28 25
"""), sep="\s+")
df["testDate"] =pd.to_datetime(df["testDate"])
df = df.assign(testMonth = df["testDate"].dt.strftime("%B")).drop(columns="testDate")
dft = (df.set_index([c for c in df.columns if c!="marks"])
.unstack("testMonth") # make month a column
.droplevel(0, axis=1) # remove unneeded level in columns
# create columns for months from column names and rename marks columns
.pipe(lambda d: d.assign(**{f"testMonth_{i+1}":c
for i,c in enumerate(d.columns)}).rename(columns={c:f"marks_{i+1}"
for i,c in enumerate(d.columns)}))
.reset_index()
)
output
Name
rollNumber
external_roll_number
marks_1
marks_2
testMonth_1
testMonth_2
0
John
34
234
15
25
April
March
As part of my ongoing quest to get my head around pandas I am confronted by a surprise series. I don't understand how and why the output is a series - I was expecting a dataframe. If someone could explain what is happening here it would be much appreciated.
ta, Andrew
Some data:
hash email date subject subject_length
0 65319af6e jbrockmendel#gmail.com 2020-11-28 REF-IntervalIndex._assert_can_do_setop-38112 44
1 0bf58d8a9 simonjayhawkins#gmail.com 2020-11-28 DOC-add-contibutors-to-1.2.0-release-notes-38132 48
2 d16df293c 45562402+rhshadrach#users.noreply.github.com 2020-11-28 TYP-Add-cast-to-ABC-Index-like-types-38043 42
...
Some Code:
def my_function(row):
output = row['email'].value_counts().sort_values(ascending = False).head(3)
return output
top_three = dataframe.groupby(pd.Grouper(key='date', freq='1M')).apply(my_function)
Some Output:
date
2020-01-31 jbrockmendel#gmail.com 159
50263213+MomIsBestFriend#users.noreply.github.com 44
TomAugspurger#users.noreply.github.com 41
...
2020-10-31 jbrockmendel#gmail.com 170
2658661+dsaxton#users.noreply.github.com 23
61934744+phofl#users.noreply.github.com 21
2020-11-30 jbrockmendel#gmail.com 134
61934744+phofl#users.noreply.github.com 36
41443370+ivanovmg#users.noreply.github.com 19
Name: email, dtype: int64
It depends on what your Groupby is returning.
In your case, you are applying a function on row['email'] and returning a single value_counts, while all other columns in your data are part of index. A reset_index() would therefore give you what you need. Meaning, you are returning a multi-index single column output after groupby, which will be returned as a Series instead of a DataFrame.
For more clarity on which data structure is returned, we can do a toy experiment.
For example, for the first case, the apply function is applying the lambda function on groups where each group contains a dataframe (check [i for i in df.groupby(['a'])] to see what each group contains.
df = pd.DataFrame({'a':[1,1,2,2,3], 'b':[4,5,6,7,8]})
print(df.groupby(['a']).apply(lambda x:x**2))
#dataframe
a b
0 1 16
1 1 25
2 4 36
3 4 49
4 9 64
For the second case, we are only applying the lambda function on a series object OR only a single series is being returned. In this case, it doesn't return a dataframe and instead returns a series.
print(df.groupby(['a'])['b'].apply(lambda x:x**2))
#series
0 16
1 25
2 36
3 49
4 64
Name: b, dtype: int64
This can be solved simply by -
print(df.groupby(['a'])[['b']].apply(lambda x:x**2))
#dataframe
b
0 16
1 25
2 36
3 49
I have a dataframe df which looks like this:
CustomerId Age
1 25
2 18
3 45
4 57
5 34
I have a list called "Price" which looks like this:
Price = [123,345,1212,11,677]
I want to add that list to the dataframe. Here is my code:
df['Price'] = Price
It seems to work but when I print the dataframe the field called "Price" contains all the metadata information such as Name, Type... as well as the value of the Price.
How can I create a column called "Price" containing only the values of the Price list so that the dataframe looks like:
CustomerId Age Price
1 25 123
2 18 345
3 45 1212
4 57 11
5 34 677
In my Opinion, the most elegant solution is to use assign:
df.assign(Price=Price)
CustomerId Age Price
1 25 123
2 18 345
3 45 1212
4 57 11
5 34 677
note that assign actually returns a DataFrame.
Assign creates a new Column 'Price' (left Price) with the content of the list Price (right Price)
import pandas as pd
df['Price'] = pd.Series(Price)
if you use this you will not get the error if you have less values in the series than to your dataframe otherwise you will get the error that will tell you have less vales in the list and that can not be appended.
I copy pasted your example into a dataframe using pandas.read_clipboard and then added the column like this:
import pandas as pd
df = pd.read_clipboard()
Price = [123,345,1212,11,677]
df.loc[:,'Price'] = Price
df
Generating this:
CustomerId Age Price
0 1 25 123
1 2 18 345
2 3 45 1212
3 4 57 11
4 5 34 677
You can add pandas series as column.
import pandas as pd
df['Price'] = pd.Series(Price)
I'm a recent convert from excel to python. I think that what I'm trying to here would be traditionally done with a Vlookup of sorts. But I might be struggling with the terminology and not being able to find the python solution. I have been using the pandas library for most of my data analysis framework.
I have two different data frames. One with the weight changes (DF1), and other with the weights(DF2). I want to go line by line (changes are chronological) and:
create a new column in DF1 with the weight before the change
(basically extracted from DF2).
update the results in DF2 where Weight = Weight + WeightChange
Note: The data frames do not have the same dimension, an individual has several weight changes(DF1) but only one weight (DF2):
Name WeightChange
1 John 5
2 Peter 10
3 John 7
4 Mary -20
5 Gary -3
DF2:
Name Weight
1 John 180
2 Peter 160
3 Mary 120
4 Gary 150
Firstly I'd merge df1 and df2 on the 'Name' column to add the weight column to df1.
Then I'd groupby df1 on name and apply a transform to calculate the total weight change for each person. transform returns a Series aligned to the orig df so you can add an aggregated column back to the df.
Then I'd merge this column to df2 and then it's a simple case of adding this total weight change to the existing weight column:
In [242]:
df1 = df1.merge(df2, on='Name', how='left')
df1['WeightChangeTotal'] = df1.groupby('Name')['WeightChange'].transform('sum')
df1
Out[242]:
Name WeightChange Weight WeightChangeTotal
0 John 5 180 12
1 Peter 10 160 10
2 John 7 180 12
3 Mary -20 120 -20
4 Gary -3 150 -3
In [243]:
df2 = df2.merge(df1[['Name','WeightChangeTotal']], on='Name')
df2
Out[243]:
Name Weight WeightChangeTotal
0 John 180 12
1 John 180 12
2 Peter 160 10
3 Mary 120 -20
4 Gary 150 -3
In [244]:
df2['Weight'] = df2['Weight'] + df2['WeightChangeTotal']
df2
Out[244]:
Name Weight WeightChangeTotal
0 John 192 12
1 John 192 12
2 Peter 170 10
3 Mary 100 -20
4 Gary 147 -3
EDIT
To address your desired behaviour for the 'WeightBefore' column:
In [267]:
df1['WeightBefore'] = df1['Weight'] + df1.groupby('Name')['WeightChange'].shift().cumsum().fillna(0)
df1
Out[267]:
Name WeightChange Weight WeightBefore
0 John 5 180 180
1 Peter 10 160 160
2 John 7 180 185
3 Mary -20 120 120
4 Gary -3 150 150
So the above groups on 'Name', applies a shift to the column and then cumsum so we add the incremental differences, we have to call fillna as this will produce NaN values where we have only a single weight change per Name.