How to transpose dataframe using classification variable [duplicate]

How to transpose dataframe using classification variable [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have dataframe in following format-
Index Stock Open
1 ABC 10
2 ABC 12
: : :
1 PQR 12
2 PQR 23
: : :
1 XYZ 0.5
2 XYZ 0.9
: : :
I would like transform above dataframe using variable(Stock) which classification variable, required format is shown below-
Index ABC PQR XYZ
1 10 12 0.5
2 12 23 0.9
: : : :
Note-there might be multiple variable...
Please tell me how can transform or transpose dataframe in above format.

I think you are searching for a pivot table:
df.pivot('Index', 'Stock', ['Open', 'Close'])
Open Close
Stock ABC PQR XYZ ABC PQR XYZ
Index
1 10.0 12.0 0.5 13.0 15.0 0.13
2 12.0 23.0 0.9 14.0 16.0 0.14
I used a test dataframe constructed like this:
s = '''Index Stock Open Close
1 ABC 10 13
2 ABC 12 14
1 PQR 12 15
2 PQR 23 16
1 XYZ 0.5 .13
2 XYZ 0.9 .14'''
df = pd.read_table(StringIO(s), sep='\s+', engine='python')
Index Stock Open Close
0 1 ABC 10.0 13.00
1 2 ABC 12.0 14.00
2 1 PQR 12.0 15.00
3 2 PQR 23.0 16.00
4 1 XYZ 0.5 0.13
5 2 XYZ 0.9 0.14

Related

Merging and Concating using pandas

Hi I have 2 dataframes as below:
key
Apple
Banana
abc
1
12
bcd
23
21
key
Train
Car
abc
11
20
jkn
2
19
I want to merge these 2 dataframes together with my key column so that I can get following table:
key
Train
Car
Banana
Apple
abc
11
20
12
1
jkn
2
19
0/NA
0/NA
bcd
0/NA
0/NA
21
23
For columns where I don't have any record like for jkn / Apple either 0 or NA should be printed.
Currently I tried using pd.concat but am not exactly able to figure out how to get my desired result.

Use pd.merge() with how='outer', read further in documentation:
import pandas as pd
import io
data_string = """key Apple Banana
abc 1 12
bcd 23 21
"""
df1 = pd.read_csv(io.StringIO(data_string), sep='\s+')
data_string = """key Train Car
abc 11 20
jkn 2 19"""
df2 = pd.read_csv(io.StringIO(data_string), sep='\s+')
# Solution
df_result = pd.merge(df1, df2, on=['key'], how='outer')
print(df_result)
key Apple Banana Train Car
0 abc 1.0 12.0 11.0 20.0
1 bcd 23.0 21.0 NaN NaN
2 jkn NaN NaN 2.0 19.0

Let's try concat then groupby.sum
out = (pd.concat([df1, df2], ignore_index=True)
.groupby('key', as_index=False).sum())
print(out)
key Apple Banana Train Car
0 abc 1.0 12.0 11.0 20.0
1 bcd 23.0 21.0 0.0 0.0
2 jkn 0.0 0.0 2.0 19.0

Create a column based on a condition python pandas

Here is the sample data
import pandas as pd
df=pd.DataFrame({'P_Name':['ABC','ABC','ABC','ABC','PQR','PQR','PQR','PQR','XYZ','XYZ','XYZ','XYZ'],
'Date':['11/01/2020','12/01/2020','13/01/2020','14/01/2020','11/01/2020','12/01/2020','13/01/2020','14/01/2020','11/01/2020','12/01/2020','13/01/2020','14/01/2020'],
'Open':['242.584','238.179','233.727','229.441','241.375','28.965','235.96','233.193','280.032','78.472','277.592','276.71'],
'End':['4.405','4.452','4.286','4.405','2.41','3.005','2.767','3.057','1.56','0.88','0.882','0.88'],
'Close':['238.179','233.727','229.441','225.036','238.965','235.96','233.193','230.136','278.472','277.592','276.71','275.83']})
I'm trying to create a new column where the condition will be that for every new product entry, the corresponding will be 1 AND will also have to check the condition if df['Close'][0] == df['Open'][1] are same the value will be 1 if not same(E.g df['Close'][8] == df['Open'][9]) then 0
df after these conditions
P_Name Date Open End Close Check
0 ABC 11/01/2020 242.584 4.405 238.179 1
1 ABC 12/01/2020 238.179 4.452 233.727 1
2 ABC 13/01/2020 233.727 4.286 229.441 1
3 ABC 14/01/2020 229.441 4.405 225.036 1
4 PQR 11/01/2020 241.375 2.41 238.965 1
5 PQR 12/01/2020 28.965 3.005 235.96 0
6 PQR 13/01/2020 235.96 2.767 233.193 1
7 PQR 14/01/2020 233.193 3.057 230.136 1
8 XYZ 11/01/2020 280.032 1.56 278.472 1
9 XYZ 12/01/2020 78.472 0.88 277.592 0
10 XYZ 13/01/2020 277.592 0.882 276.71 1
11 XYZ 14/01/2020 276.71 0.88 275.83 1

You can compare shifted values per groups by DataFrameGroupBy.shift with Series.eq with replace missing values by another column by Series.fillna with cast mask to 0,1 with Series.astype:
df['Check'] = df.Open.eq(df.groupby('P_Name').Close.shift().fillna(df.Open)).astype(int)
Anothr idea is compare without groups, but chain another mask with Series.duplicated for match first rows per groups:
df['Check'] = (~df.P_Name.duplicated() | df.Open.eq(df.Close.shift())).astype(int)
print (df)
P_Name Date Open End Close Check
0 ABC 11/01/2020 242.584 4.405 238.179 1
1 ABC 12/01/2020 238.179 4.452 233.727 1
2 ABC 13/01/2020 233.727 4.286 229.441 1
3 ABC 14/01/2020 229.441 4.405 225.036 1
4 PQR 11/01/2020 241.375 2.41 238.965 1
5 PQR 12/01/2020 28.965 3.005 235.96 0
6 PQR 13/01/2020 235.96 2.767 233.193 1
7 PQR 14/01/2020 233.193 3.057 230.136 1
8 XYZ 11/01/2020 280.032 1.56 278.472 1
9 XYZ 12/01/2020 78.472 0.88 277.592 0
10 XYZ 13/01/2020 277.592 0.882 276.71 1
11 XYZ 14/01/2020 276.71 0.88 275.83 1

check = []
for i in range(df.index - 1):
if df['Close'][i] == df['Open'][i+1]:
check.append (1)
else
check.append (0)
df['Check'] = check

Python Pandas: Generate a new column that calculates the subtotal of all the cells above that row in a specific column

Sorry for the seemingly confusing title. The problem shall be really simple but I'm stumped and need some help here.
The data frame that I have now:
New_ID STATE MEAN
0 1 Lagos 7166.101571
1 2 Rivers 2464.065846
2 3 Oyo 1974.699365
3 4 Akwa 1839.126698
4 5 Kano 1757.642462
I want to create a new column that in row i, it will calculate df[:i,'MEAN'].sum()/df['MEAN'].sum()
For example, for data frame:
ID MEAN
0 1.0 5
1 2.0 10
2 3.0 15
3 4.0 30
4 5.0 40
My desired output:
ID MEAN SUBTOTAL
0 1.0 5 0.05
1 2.0 10 0.10
2 3.0 15 0.30
3 4.0 30 0.60
4 5.0 40 1.00
I tried
df1['SUbTotal'] = df1.loc[:df1['New_ID'], 'MEAN']/df1['MEAN'].sum()
but it says:
Name: New_ID, dtype: int32' is an invalid key
Thanks for your time in advance

This should do it, it seems like you're looking for cumsum:
df['SUBTOTAL'] = df.MEAN.cumsum() / df.MEAN.sum()
>>> df
ID MEAN SUBTOTAL
0 1.0 5 0.05
1 2.0 10 0.15
2 3.0 15 0.30
3 4.0 30 0.60
4 5.0 40 1.00

How can I calculate pct_change() in pandas across two columns, row by row?

I have this:
df['new'] = df[['col1', 'col2']].pct_change(axis=1)
I want the percent change across rows in col1 and col2. However I am getting the error:
ValueError: Wrong number of items passed 2, placement implies 1
What am I doing wrong?

The percent change function is returning a pandas DataFrame object with two columns! This is why you see the ValueError where 1 item is expected instead of two.
import numpy as np
x = np.range(1,11)
y = x*3
df = pd.DataFrame()
df['col1'] = x
df['col2'] = y
df
col1 col2
0 1 3
1 2 6
2 3 9
3 4 12
4 5 15
5 6 18
6 7 21
7 8 24
8 9 27
9 10 30
df.pct_change(axis=1)
col1 col2
0 NaN 2.0
1 NaN 2.0
2 NaN 2.0
3 NaN 2.0
4 NaN 2.0
5 NaN 2.0
6 NaN 2.0
7 NaN 2.0
8 NaN 2.0
9 NaN 2.0
The percent change across rows that you want is stored in the last column ('col2' in this case) so just choose that last column to populate the 'new' column. In this case we compute a 200% change for every row.
df['new'] = df.pct_change(axis=1)['col2']
col1 col2 new
0 1 3 2.0
1 2 6 2.0
2 3 9 2.0
3 4 12 2.0
4 5 15 2.0
5 6 18 2.0
6 7 21 2.0
7 8 24 2.0
8 9 27 2.0
9 10 30 2.0

Transposing and Aggregating DataFrame

I have a dataframe like this
name tag time val
0 ABC A 1 10
0 ABC A 1 12
1 ABC B 1 12
1 ABC B 1 14
2 ABC A 2 11
3 ABC C 2 12
4 DEF B 3 10
5 DEF C 3 9
6 GHI A 4 14
7 GHI B 4 12
8 GHI C 5 10
Each row is a timestamp and shows the value between the name and tag in that row.
What I want is a dataframe where each row shows the mean value from each tag at each timestamp, like this:
name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0
I can achieve this successfully by grouping by name and time and returning a transposed series each time:
def transpose_df(observation_df):
ser = pd.Series()
for tag in tags:
ser[tag] = observation_df[observation_df['tag'] == tag]['val'].mean()
return ser
tdf = df.groupby(['name', 'time']).apply(transpose_df).reset_index()
But this is slow. I feel like there must be a smarter way using a builtin transpose/reshape tool, but I can't figure it out. Can anyone see suggest a better alternative?

In [175]: df.pivot_table(index=['name','time'], columns='tag', values='val').reset_index()
Out[175]:
tag name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0

Option 1
Use pivot_table:
df.pivot_table(values='val',index=['name','time'],columns='tag',aggfunc='mean').reset_index()
Output:
tag name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0
Option 2:
Use groupby and unstack
df.groupby(['name','time','tag']).agg('mean')['val'].unstack().reset_index()
Output:
tag name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0
Option 3
Use set_index and mean and unstack:
df.set_index(['name','time','tag']).mean(level=[0,1,2])['val'].unstack().reset_index()
Output:
tag name time A B C
0 ABC 1 11.0 13.0 NaN
1 ABC 2 11.0 NaN 12.0
2 DEF 3 NaN 10.0 9.0
3 GHI 4 14.0 12.0 NaN
4 GHI 5 NaN NaN 10.0

You can also groupby and then unstack (equivalent to a pivot table).
>>> df.groupby(['name', 'time', 'tag'])['val'].mean().unstack('tag').reset_index()
tag name time A B C
0 ABC 1 11 13 NaN
1 ABC 2 11 NaN 12
2 DEF 3 NaN 10 9
3 GHI 4 14 12 NaN
4 GHI 5 NaN NaN 10
By the way, transform is for when you want to maintain the shape of your original dataframe, e.g.
>>> df.assign(tag_mean=df.groupby(['name', 'time', 'tag'])['val'].transform(np.mean))
name tag time val tag_mean
0 ABC A 1 10 11
0 ABC A 1 12 11
1 ABC B 1 12 13
1 ABC B 1 14 13
2 ABC A 2 11 11
3 ABC C 2 12 12
4 DEF B 3 10 10
5 DEF C 3 9 9
6 GHI A 4 14 14
7 GHI B 4 12 12
8 GHI C 5 10 10

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to transpose dataframe using classification variable [duplicate] - python

Related

Merging and Concating using pandas

Create a column based on a condition python pandas

Python Pandas: Generate a new column that calculates the subtotal of all the cells above that row in a specific column

How can I calculate pct_change() in pandas across two columns, row by row?

Transposing and Aggregating DataFrame

Categories

Resources