Transpose/Pivot DataFrame but not all columns in the same row - python

I have a DataFrame and I want to transpose it.
import pandas as pd
df = pd.DataFrame({'ID':[111,111,222,222,333,333],'Month':['Jan','Feb','Jan','Feb','Jan','Feb'],
'Employees':[2,3,1,5,7,1],'Subsidy':[20,30,10,15,40,5]})
print(df)
ID Month Employees Subsidy
0 111 Jan 2 20
1 111 Feb 3 30
2 222 Jan 1 10
3 222 Feb 5 15
4 333 Jan 7 40
5 333 Feb 1 5
Desired output:
ID Var Jan Feb
0 111 Employees 2 3
1 111 Subsidy 20 30
0 222 Employees 1 5
1 222 Subsidy 10 15
0 333 Employees 7 1
1 333 Subsidy 40 5
My attempt: I tried using pivot_table(), but both Employees & Subsidy naturally appear in same rows, where as I want them on separate rows.
df.pivot_table(index=['ID'],columns='Month',values=['Employees','Subsidy'])
Employees Subsidy
Month Feb Jan Feb Jan
ID
111 3 2 30 20
222 5 1 15 10
333 1 7 5 40
I tried using transpose(), but it transposes entire DataFrame, it seems there is no possibility to transpose by first fixing a column. Any suggestions?

You can add DataFrame.rename_axis for set new column name for first level after pivoting and also None for avoid Month column name in final DataFrame, which is reshaped by DataFrame.stack by first level, last MultiIndex in converted to coumns by DataFrame.reset_index:
df2 = (df.pivot_table(index='ID',
columns='Month',
values=['Employees','Subsidy'])
.rename_axis(['Var',None], axis=1)
.stack(level=0)
.reset_index()
)
print (df2)
ID Var Feb Jan
0 111 Employees 3 2
1 111 Subsidy 30 20
2 222 Employees 5 1
3 222 Subsidy 15 10
4 333 Employees 1 7
5 333 Subsidy 5 40

You were on point with your pivot_table approach. Only thing is you missed stack and reset_index :
df.pivot_table(index=['ID'],columns='Month',values=['Employees','Subsidy']).stack(0).reset_index()
Out[42]:
Month ID level_1 Feb Jan
0 111 Employees 3 2
1 111 Subsidy 30 20
2 222 Employees 5 1
3 222 Subsidy 15 10
4 333 Employees 1 7
5 333 Subsidy 5 40
You can change the column name to var later if it's needed.

Related

Grouping of a dataframe monthly after calculating the highest daily values

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

Creating a Box-Plot but by value_counts() [Number of events occurred]

I have the following dataframe. Each entry is an event that occurred [550624 events]. Suppose we are interested in a box-plot of the number of events occurring per day each month.
print(df)
Month Day
0 4 1
1 4 1
2 4 1
3 4 1
4 4 1
... ...
550619 10 31
550620 10 31
550621 10 31
550622 10 31
550623 10 31
[550624 rows x 2 columns]
df2 = df.groupby('Month')['Day'].value_counts().sort_index()
Month Day
4 1 2162
2 1564
3 1973
4 1620
5 1860
10 27 2022
28 1606
29 1316
30 1674
31 1726
sns.boxplot(x = df2.index.get_level_values('Month'), y = df2)
Output of sns.boxplot
My question is whether this way is the most efficient/direct way to create this visual info or if I am taking a round-about way of achieving this.
Is there a more direct way to achieve this visual?

Pandas Collapse and Stack Multi-level columns

I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)

How to perform conditional updation of column values in Pandas DataFrame?

I have a below dataframe is there any way to perform conditional addition of column values in pandas.
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 NaN 4 5 5 54 3 2
222 bbb pune 1 70 NaN 5 4 4 8 3 4
333 ccc mumbai 2 NaN NaN 9 3 4 8 4 3
444 ddd hyd 4 NaN NaN 3 8 6 4 2 7
What I want to achive
if city = pune default_sal should be updated in total_sal for ex for
emp_id 111 total_salary should be 90
if city!=pune then depending on months_worked value total salary
should be updated.For ex for emp id 333 months_worked =2 So addition
of jan and feb value should be updated as total_sal which is 9+3=12
Desired O/P
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 90 4 5 5 54 3 2
222 bbb pune 1 70 70 5 4 4 8 3 4
333 ccc mumbai 2 NaN 12 9 3 4 8 4 3
444 ddd hyd 4 NaN 21 3 8 6 4 2 7
Using np.where after create the help series
s1=pd.Series([df.iloc[x,6:y+6].sum() for x,y in enumerate(df.months_worked)],index=df.index)
np.where(df.City=='pune',df.default_sal,s1 )
Out[429]: array([90., 70., 12., 21.])
#df['total']=np.where(df.City=='pune',df.default_sal,s1 )

calculating mean and sum in pivot_table in pandas sorted by two separate desired col values

I have a data set from 2015-2018 which has months and days as 2nd and third col like below:
Year Month Day rain temp humidity snow
2015 1 1 0 20 60 0
2015 1 2 2 18 58 0
2015 1 3 0 20 62 2
2015 1 4 5 15 62 0
2015 1 5 2 18 61 1
2015 1 6 0 19 60 2
2015 1 7 3 20 59 0
2015 1 8 2 17 65 0
2015 1 9 1 17 61 0
I wanted to use pivot_table to calculate something like (the mean of temperature for year 2016 and months (1,2,3)
I was wondering if anyone could help me with this?
You can do with pd.cut then groupby
df.temp.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
Out[93]:
Year Month
2015 Winter 18.222222

Categories