Pandas pivot table subtotals with multi-index - python

I'm trying to create a simple pivot table with subtotals, excel-style, however I can't find a method using Pandas. I've tried the solution Wes suggested in another subtotal-related question, however that doesn't give the expected results. Below the steps to reproduce it:
Create the sample data:
sample_data = {'customer': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'B'], 'product': ['astro','ball','car','astro','ball', 'car', 'astro', 'ball', 'car','astro','ball','car'],
'week': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'qty': [10, 15, 20, 40, 20, 34, 300, 20, 304, 23, 45, 23]}
df = pd.DataFrame(sample_data)
create the pivot table with margins (it only has total, not subtotal by customer (A, B))
piv = df.pivot_table(index=['customer','product'],columns='week',values='qty',margins=True,aggfunc=np.sum)
week 1 2 All
customer product
A astro 10 300 310
ball 15 20 35
car 20 304 324
B astro 40 23 63
ball 20 45 65
car 34 23 57
All 139 715 854
Then, I tried the method Wes Mckiney mentioned in another thread, using the stack function:
piv2 = df.pivot_table(index='customer',columns=['week','product'],values='qty',margins=True,aggfunc=np.sum)
piv2.stack('product')
The result has the format I want, but the rows with the "All" doesn't have the sum:
week 1 2 All
customer product
A NaN NaN 669.0
astro 10.0 300.0 NaN
ball 15.0 20.0 NaN
car 20.0 304.0 NaN
B NaN NaN 185.0
astro 40.0 23.0 NaN
ball 20.0 45.0 NaN
car 34.0 23.0 NaN
All NaN NaN 854.0
astro 50.0 323.0 NaN
ball 35.0 65.0 NaN
car 54.0 327.0 NaN
how to make it work as it would in Excel, sample below? with all the subtotals and totals working? what am I missing? ed
excel sample
just to point, I am able to make it work using For loops filtering by the customer on each iteration and concat later, but I hope there might be a more direct solution thank you

You can do it one step, but you have to be strategic about index name due to alphabetical sorting:
piv = df.pivot_table(index=['customer','product'],
columns='week',
values='qty',
margins=True,
margins_name='Total',
aggfunc=np.sum)
(pd.concat([piv,
piv.query('customer != "Total"')
.sum(level=0)
.assign(product='total')
.set_index('product', append=True)])
.sort_index())
Output:
week 1 2 Total
customer product
A astro 10 300 310
ball 15 20 35
car 20 304 324
total 45 624 669
B astro 40 23 63
ball 20 45 65
car 34 23 57
total 94 91 185
Total 139 715 854

#Scott Boston's answer is perfect and elegant. For reference, if you group just the customers and pd.concat() the results are We get the following results.
piv = df.pivot_table(index=['customer','product'],columns='week',values='qty',margins=True,aggfunc=np.sum)
piv3 = df.pivot_table(index=['customer'],columns='week',values='qty',margins=True,aggfunc=np.sum)
piv4 = pd.concat([piv, piv3], axis=0)
piv4
week 1 2 All
(A, astro) 10 300 310
(A, ball) 15 20 35
(A, car) 20 304 324
(B, astro) 40 23 63
(B, ball) 20 45 65
(B, car) 34 23 57
(All, ) 139 715 854
A 45 624 669
B 94 91 185
All 139 715 854

Related

How to add more rows of random values to an existing column in my dataset - pandas?

I want to add ten more rows to each column of the dataset provided below. It should add random integer values ranging from :
20-27 for temperature
40-55 for humidity
150-170 for moisture
Dataset:
Temperature Humidity Moisture
0 22 46 0
1 36 41.4 170
2 18 69.3 120
3 21 39.3 200
4 39 70 150
5 22 78 220
6 27 65 180
7 32 75 250
I have tried:
import numpy as np
import pandas as pd
data1 = np.random.randint(20,27,size=10)
df = pd.DataFrame(data, columns=['Temperature'])
print(df)
This method deletes all the existing row values and gives out only the random values. What I all need is the existing rows and the random values in addition.
Use:
df1 = pd.DataFrame({'Temperature':np.random.randint(20,28,size=10),
'Humidity':np.random.randint(40,56,size=10),
'Moisture':np.random.randint(150,171,size=10)})
df = pd.concat([df, df1], ignore_index=True)
print (df)
Temperature Humidity Moisture
0 22 46.0 0
1 36 41.4 170
2 18 69.3 120
3 21 39.3 200
4 39 70.0 150
5 22 78.0 220
6 27 65.0 180
7 32 75.0 250
8 20 52.0 158
9 21 45.0 156
10 23 49.0 151
11 24 51.0 167
12 22 45.0 157
13 21 43.0 163
14 26 55.0 162
15 25 40.0 164
16 24 40.0 155
17 20 48.0 150

Pandas drop consecutive duplicate rows only, ignoring specific columns

I have a dataframe below
df = pd.DataFrame({
'ID': ['James', 'James', 'James', 'James',
'Max', 'Max', 'Max', 'Max', 'Max',
'Park', 'Park','Park', 'Park',
'Tom', 'Tom', 'Tom', 'Tom'],
'From_num': [578, 420, 420, 'Started', 298, 78, 36, 298, 'Started', 28, 28, 311, 'Started', 60, 520, 99, 'Started'],
'To_num': [96, 578, 578, 420, 36, 298, 78, 36, 298, 112, 112, 28, 311, 150, 60, 520, 99],
'Date': ['2020-05-12', '2020-02-02', '2020-02-01', '2019-06-18',
'2019-08-26', '2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2020-05-20', '2019-11-22',
'2019-04-12', '2019-10-16', '2019-08-26', '2018-12-11', '2018-10-09']})
and it is like this:
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
2 James 420 578 2020-02-01 # Drop the this duplicated row (ignore date)
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
10 Park 28 112 2020-05-20 # Drop this duplicate row (ignore date)
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
There are some consecutive duplicated values (ignore the Date value) within each 'ID'(Name), e.g. line 1 and 2 for James, the From_num are both 420, same as line 9 and 10, I wish to drop the 2nd duplicated row and keep the first. I wrote loop conditions, but it is very redundant and slow, I assume there might be easier way to do this, so please help if you have ideas. Great thanks. The expected result is like this:
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-08-26
4 Max 78 298 2019-06-20
5 Max 36 78 2019-01-30
6 Max 298 36 2018-10-23
7 Max Started 298 2018-08-29
8 Park 28 112 2020-05-21
9 Park 311 28 2019-11-22
10 Park Started 311 2019-04-12
11 Tom 60 150 2019-10-16
12 Tom 520 60 2019-08-26
13 Tom 99 520 2018-12-11
14 Tom Started 99 2018-10-09
It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".
t = df[['ID', 'From_num', 'To_num']]
df[(t.ne(t.shift())).any(axis=1)]
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
This drops rows with index values 2 and 10.
Compare the rows below, with the rows above, invert the boolean to get your result:
cond1 = df.ID.eq(df.ID.shift())
cond2 = df.From_num.eq(df.From_num.shift())
cond = cond1 & cond2
df.loc[~cond].reset_index(drop=True)
Alternative: longer route :
(
df.assign(
temp=df.groupby(["ID", "From_num"]).From_num.transform("size"),
check=lambda x: (x.From_num.eq(x.From_num.shift())) &
(x.temp.eq(x.temp.shift())),
)
.query("check == 0")
.drop(["temp", "check"], axis=1)
)
it seems to me that's exactly what DataFrame.drop_duplicates does, by default it keeps the first occurence and drops the rest
unique_df = df.drop_duplicates(['ID', 'From_num', 'To_num'])
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
EDIT
as mentionned in the question, only the consecutive rows should be processed, to do so I propose to flag them first then run drop_duplicates on a subset of the flagged rows ( I'm not sure if it's the best solution )
df['original_index'] = null
indices = df.index[1:]
for i in range(1, indices):
# if current row equals the previous one
if df.loc[indices[i - 1], 'ID'] == df.loc[indices[i], 'ID'] and df.loc[indices[i -1], 'From_num'] == df.loc[indices[i], 'From_num'] and df.loc[indices[i -1], 'To_num'] == df.loc[indices[i], 'To_num']:
# get the original index if it has been already set on row index -1
if df.loc[indices[i - 1], 'original_index'] not null:
df.loc[indices[i], 'original_index'] = df.loc[indices[i - 1], 'original_index']
else:
# else set it to be current index for both rows
df.loc[indices[i - 1], 'original_index'] = indices[i - 1]
df.loc[indices[i], 'original_index'] = indices[i - 1]
now we add column 'original_index' to drop_duplicates
unique_df = df.drop_duplicates(['ID', 'From_num', 'To_num', 'original_index'])
df.groupby(['ID', 'From_num', 'To_num']).first().reset_index()
Edit - This will remove duplicates even if they are not consecutive. E.g row 4 and 7 in the original df.
Update
cols=['ID', 'From_num', 'To_num']
df.loc[(df[cols].shift() != df[cols]).any(axis=1)].shape

sum values in column grouped by another column pandas

My df looks like this:
country id x y
AT 11 50 100
AT 12 NaN 90
AT 13 NaN 104
AT 22 40 50
AT 23 30 23
AT 61 40 88
AT 62 NaN 78
UK 11 40 34
UK 12 NaN 22
UK 13 NaN 70
What I need is the sum of the y column in the first row that is not NaN in x, grouped by the first number on the left of the column id. This separately for each country. At the end I just need to drop the NaN.
The result should be something like this:
country id x y
AT 11 50 294
AT 22 40 50
AT 23 30 23
AT 61 40 166
UK 11 40 126
You can aggregate by GroupBy.agg by first and sum functions with helper Series by compare non missing values by Series.notna and cumulative sum by Series.cumsum:
df1 = (df.groupby(['country', df['x'].notna().cumsum()])
.agg({'id':'first', 'x':'first', 'y':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
country id x y
0 AT 11 50.0 294
1 AT 22 40.0 50
2 AT 23 30.0 23
3 AT 61 40.0 166
4 UK 11 40.0 126
If possible first value(s) of x are misisng values add DataFrame.dropna:
print (df)
country id x y
0 AT 11 NaN 100
1 AT 11 50.0 100
2 AT 12 NaN 90
3 AT 13 NaN 104
4 AT 22 40.0 50
5 AT 23 30.0 23
6 AT 61 40.0 88
7 AT 62 NaN 78
8 UK 11 40.0 34
9 UK 12 NaN 22
10 UK 13 NaN 70
df1 = (df.groupby(['country', df['x'].notna().cumsum()])
.agg({'id':'first', 'x':'first', 'y':'sum'})
.reset_index(level=1, drop=True)
.reset_index()
.dropna(subset=['x']))
print (df1)
country id x y
1 AT 11 50.0 294
2 AT 22 40.0 50
3 AT 23 30.0 23
4 AT 61 40.0 166
5 UK 11 40.0 126
Use groupby, transform and dropna:
print (df.assign(y=df.groupby(df["x"].notnull().cumsum())["y"].transform('sum'))
.dropna(subset=["x"]))
country id x y
0 AT 11 50.0 294
3 AT 22 40.0 50
4 AT 23 30.0 23
5 AT 61 40.0 166
7 UK 11 40.0 126

How do I reshape this DataFrame in Python?

I have a DataFrame df_sale in Python that I want to reshape, count the sum across the price column and add a new coloumn total. Below is the df_sale:
b_no a_id price c_id
120 24 50 2
120 56 100 2
120 90 25 2
120 45 20 2
231 89 55 3
231 45 20 3
231 10 250 3
Excepted Output after reshaping:
b_no a_id_1 a_id_2 a_id_3 a_id_4 total c_id
120 24 56 90 45 195 2
231 89 45 10 0 325 3
What I have tried so far is use the sum() on df_sale['price'] separately for 120 and 231. I do not understand how should I reshape the data, add new column headers and get the total without being computationally inefficient. Thanks.
This might not be the cleanest method (at all), but it gets the outcome you want:
reshaped_df = (df.groupby('b_no')[['price', 'c_id']]
.first()
.join(df.groupby('b_no')['a_id']
.apply(list)
.apply(pd.Series)
.add_prefix('a_id_'))
.drop('price',1)
.join(df.groupby('b_no')['price'].sum().to_frame('total'))
.fillna(0))
>>> reshaped_df
c_id a_id_0 a_id_1 a_id_2 a_id_3 total
b_no
120 2 24.0 56.0 90.0 45.0 195
231 3 89.0 45.0 10.0 0.0 325
You can achieve this grouping by b_no and c_id, summing total, and flattening a_id:
import pandas as pd
d = {"b_no": [120,120,120,120,231,231, 231],
"a_id": [24,56,90,45,89,45,10],
"price": [50,100,25,20,55,20,250],
"c_id": [2,2,2,2,3,3,3]}
df = pd.DataFrame(data=d)
df2 = df.groupby(['b_no', 'c_id'])['a_id'].apply(list).apply(pd.Series).add_prefix('a_id_').fillna(0)
df2["total"] = df.groupby(['b_no', 'c_id'])['price'].sum()
print(df2)
a_id_0 a_id_1 a_id_2 a_id_3 total
b_no c_id
120 2 24.0 56.0 90.0 45.0 195
231 3 89.0 45.0 10.0 0.0 325

Replacing values in a pandas multi-index

I have a dataframe with a multi-index. I want to change the value of the 2nd index when certain conditions on the first index are met.
I found a similar (but different) question here: Replace a value in MultiIndex (pandas)
which doesn't answer my point because that was about changing a single row, and the solution passed the value of the first index (which didn't need changing), too. In my case I am dealing with multiple rows and I haven't been able to adapt that solution to my case.
A minimal example of my data is below. Thanks!
import pandas as pd
import numpy as np
consdf=pd.DataFrame()
for mylocation in ['North','South']:
for scenario in np.arange(1,4):
df= pd.DataFrame()
df['mylocation'] = [mylocation]
df['scenario']= [scenario]
df['this'] = np.random.randint(10,100)
df['that'] = df['this'] * 2
df['something else'] = df['this'] * 3
consdf=pd.concat((consdf, df ), axis=0, ignore_index=True)
mypiv = consdf.pivot('mylocation','scenario').transpose()
level_list =['this','that']
# if level 0 is in level_list --> set level 1 to np.nan
mypiv.iloc[mypiv.index.get_level_values(0).isin(level_list)].index.set_levels([np.nan], level =1, inplace=True)
The last line doesn't work: I get:
ValueError: On level 1, label max (2) >= length of level (1). NOTE: this index is in an inconsistent state
IIUC you could add new value to level values, and then change labels for your index, using advanced indexing, get_level_values, set_levels and set_labels methods:
len_ind = len(mypiv.loc[(level_list,)].index.get_level_values(1))
mypiv.index.set_levels([1, 2, 3, np.nan], level=1, inplace=True)
mypiv.index.set_labels([3]*len_ind + mypiv.index.labels[1][len_ind:].tolist(), level=1, inplace=True)
In [219]: mypiv
Out[219]:
mylocation North South
scenario
this NaN 26 46
NaN 32 67
NaN 75 30
that NaN 52 92
NaN 64 134
NaN 150 60
something else 1.0 78 138
2.0 96 201
3.0 225 90
Note You values for other scenario will convert to float because it should be one type and np.nan has float type.
Note: ix has been deprecated in Pandas 0.20+. Use loc accessor instead.
Here is a solution, using reset_index() method:
In [95]: new = mypiv.reset_index()
In [96]: new
Out[96]:
mylocation level_0 scenario North South
0 this 1 32 64
1 this 2 18 40
2 this 3 76 56
3 that 1 64 128
4 that 2 36 80
5 that 3 152 112
6 something else 1 96 192
7 something else 2 54 120
8 something else 3 228 168
In [100]: new.ix[new.level_0.isin(level_list), 'scenario'] = np.nan
In [101]: new
Out[101]:
mylocation level_0 scenario North South
0 this NaN 32 64
1 this NaN 18 40
2 this NaN 76 56
3 that NaN 64 128
4 that NaN 36 80
5 that NaN 152 112
6 something else 1.0 96 192
7 something else 2.0 54 120
8 something else 3.0 228 168
In [103]: mypiv = new.set_index(['level_0', 'scenario'])
In [104]: mypiv
Out[104]:
mylocation North South
level_0 scenario
this NaN 32 64
NaN 18 40
NaN 76 56
that NaN 64 128
NaN 36 80
NaN 152 112
something else 1.0 96 192
2.0 54 120
3.0 228 168
But I suspect there is a more elegant solution.

Categories