Python Grouping Transpose - python

I have my data in a pandas dataframe
out[1]:
NAME STORE AMOUNT
0 GARY GAP 20
1 GARY GAP 10
2 GARY KROGER 15
3 ASHLEY FOREVER21 30
4 ASHLEY KROGER 10
5 MARK GAP 10
6 ROGER KROGER 30
I'm trying to get grouping by name, sum their total amount spent, while also generating columns for each unique store in the dataframe.
Desired:
out[1]:
NAME GAP KROGER FOREVER21
0 GARY 30 15 0
1 ASHLEY 0 10 30
2 MARK 10 0 0
3 ROGER 0 30 0
Thanks for your help!

You need pivot_table:
df1 = df.pivot_table(index='NAME',
columns='STORE',
values='AMOUNT',
aggfunc='sum',
fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Alternative solution with aggregating by groupby and sum:
df1 = df.groupby(['NAME','STORE'])['AMOUNT'].sum().unstack(fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Last if need column from index values and remove column and index names:
print (df1.reset_index().rename_axis(None, axis=1).rename_axis(None))
NAME FOREVER21 GAP KROGER
0 ASHLEY 30 0 10
1 GARY 0 30 15
2 MARK 0 10 0
3 ROGER 0 0 30

Related

How do I check if a date has consecutive rows in pandas?

This is my dataframe,
date id name score
2020-10-19 1 Peter 0
2020-10-19 2 Betty 50
2020-10-19 3 Susie 45
2020-10-18 1 Peter 0
2020-10-18 2 Betty 50
2020-10-18 3 Susie 45
2020-10-17 1 Peter 60
2020-10-17 2 Betty 0
2020-10-17 3 Susie 45
How can I check if there was a score of 0 on two consecutive days? The following
table should be returned. (Betty did not have 0 on two consecutive dates)
date id name score
2020-10-19 1 Peter 0
2020-10-18 1 Peter 0
I have tried:
df['score'] = (df.score.diff(1) == 0).astype('int').cumsum()
Note: the datetimes are always sorted in descending order.
If datetimes are sorted you can test if 2 consecutive values per groups are 0:
m1 = df['score'].eq(0) & df.groupby('id')['score'].shift(-1).eq(0)
m2 = df['score'].eq(0) & df.groupby('id')['score'].shift().eq(0)
df = df[m1 | m2]
print (df)
date id name score
0 2020-10-19 1 Peter 0
3 2020-10-18 1 Peter 0

Using groupby on a Pandas DataFrame to add arbitrary number of columns and calculate values [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a pandas DataFrame I want to to convert into a time table (for visualization purposes) by using groupby and adding an arbitrary number of columns based on hour time increments, and populating the data from a 3rd column.
The source DataFrame might look like:
ID Hour Floor
Jay 2 34
Jay 3 34
Tim 0 36
Tim 1 34
Tim 2 36
Tom 3 32
Tom 4 36
Rob 3 31
Rob 4 32
Rob 5 33
Rob 6 34
...
What I am aiming for is:
ID HOUR_0 HOUR_1 HOUR_2 HOUR_3 HOUR_4 HOUR_5 HOUR_6...
Jay 0 0 34 34 0 0 0
Tim 36 34 36 0 0 0 0
Tom 0 0 0 32 36 0 0
Rob 0 0 0 31 32 33 34
What I can't get (without manually constructing this using loops) is adding an arbitrary number of columns (after a groupby operation) based on the unique or range of hours in the first DataFrame, and then calculating each column value based on on the Hour and Floor columns from the first DataFrame.
Any ideas?
Because I can't help but show how this works with pd.factorize
i, r = pd.factorize(df.ID)
j, c = pd.factorize(df.Hour, sort=True)
b = np.zeros((r.size, c.size), df.Floor.dtype)
b[i, j] = df.Floor.values
d = pd.DataFrame(b, r, [f'Hour_{h}' for h in c])
d
Hour_0 Hour_1 Hour_2 Hour_3 Hour_4 Hour_5 Hour_6
Jay 0 0 34 34 0 0 0
Tim 36 34 36 0 0 0 0
Tom 0 0 0 32 36 0 0
Rob 0 0 0 31 32 33 34
Is this simple pivot ?
df.pivot(*df.columns).fillna(0).add_prefix('Hour_')
Out[71]:
Hour Hour_0 Hour_1 Hour_2 Hour_3 Hour_4 Hour_5 Hour_6
ID
Jay 0.0 0.0 34.0 34.0 0.0 0.0 0.0
Rob 0.0 0.0 0.0 31.0 32.0 33.0 34.0
Tim 36.0 34.0 36.0 0.0 0.0 0.0 0.0
Tom 0.0 0.0 0.0 32.0 36.0 0.0 0.0
You are looking for unstack(). But first we need to set_index():
df = df.set_index(['ID','Hour']).unstack(fill_value=0).add_prefix('HOUR_')
df.columns = df.columns.get_level_values(1)
Or using pivot as suggested by Wen:
df = (df.pivot(index='ID', columns='Hour', values='Floor')
.fillna(0)
.astype(int)
.add_prefix('HOUR_'))
Full example:
import pandas as pd
data = '''\
ID Hour Floor
Jay 2 34
Jay 3 34
Tim 0 36
Tim 1 34
Tim 2 36
Tom 3 32
Tom 4 36
Rob 3 31
Rob 4 32
Rob 5 33
Rob 6 34'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
# Apply solution
df = df.set_index(['ID','Hour']).unstack(fill_value=0).add_prefix('HOUR_')
df.columns = df.columns.get_level_values(1)
Df is now:
HOUR_0 HOUR_1 HOUR_2 HOUR_3 HOUR_4 HOUR_5 HOUR_6
ID
Jay 0 0 34 34 0 0 0
Rob 0 0 0 31 32 33 34
Tim 36 34 36 0 0 0 0
Tom 0 0 0 32 36 0 0

Get order of subgroups in pandas dataframe

I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df
Distance Name Time
0 23 Kate 3
1 16 John 5
2 32 Peter 2
3 15 Kate 7
4 31 John 9
5 26 Peter 2
I want to add a column that tells me, for each Name, what's the order of the times.
I want something like this:
Order Distance Name Time
0 16 John 5
1 31 John 9
0 23 Kate 3
1 15 Kate 7
0 32 Peter 2
1 26 Peter 4
I can do it using a for loop:
df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want
for name, row in df.groupby('Name').count().iterrows():
table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
to_concat = [df2,table]
df2 = pd.concat(to_concat)
df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2
This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.
Can someone help me write this in a simpler way that runs faster?
I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.
Best,
Use sort_values with cumcount:
df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
If need first column use insert:
df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
Order Distance Name Time
1 0 16 John 5
4 1 31 John 9
0 0 23 Kate 3
3 1 15 Kate 7
2 0 32 Peter 2
5 1 26 Peter 4
In [67]: df = df.sort_values(['Name','Time']) \
.assign(Order=df.groupby('Name').cumcount())
In [68]: df
Out[68]:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
PS I'm not sure this is the most elegant way to do this...

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Why am I not able to drop values within columns on pandas using python3?

I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.
You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1

Categories