Python Pandas transpose Date Range of Values - python

I have this Python data frame
year_month Type ID Values1 Values2 Values3 ...
2022-01 A 1 1 0 0
2022-02 A 1 3 4 6
2022-03 A 1 5 9 10
2022-01 B 2 5 9 10
2022-02 B 2 4 2 1
.... ... ... ...
I want to transpose my results like this? How I can do this with python?
ID Type Values 2022-01 2022-02 2022-03 ...
1 A Values1 1 3 5
1 A Values2 0 4 9
1 A Values3 0 6 10
2 B Values1 5 4 0
2 B Values2 9 2 0
2 B Values3 10 1 0
...

Try using pivot
new = df.pivot(index=['ID', 'Type'], columns='year_month').stack(level=0).reset_index()
year_month ID Type level_2 2022-01 2022-02 2022-03
0 1 A Values1 1.0 3.0 5.0
1 1 A Values2 0.0 4.0 9.0
2 1 A Values3 0.0 6.0 10.0
3 2 B Values1 5.0 4.0 NaN
4 2 B Values2 9.0 2.0 NaN
5 2 B Values3 10.0 1.0 NaN
you can remove the index name (year_month) if you want by doing
new.columns.name = ''

Related

Concatenating pandas dataframe on basis of increasing index number and retaining their positon on basis of it

Edited:
I have a pandas dataframe as follows:
Class Sex SibSp Fare
0 0 0 0 0
2 2 2 2 2
3 3 3 3 3
5 5 5 5 5
I have another pandas dataframe as follows:
Class Sex SibSp Fare
1 1 1 1 1
4 4 4 4 4
If I concate these 2 dataframe using
pd.concat([traindf,testdf])
I get the following result:
Class Sex SibSp Fare
0 0 0 0 0
2 2 2 2 2
3 3 3 3 3
5 5 5 5 5
1 1 1 1 1
4 4 4 4 4
However, I want to get result as follows:
Class Sex SibSp Fare
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
I have used pd.concat([traindf,testdf]).sort_values() but this does not work. Any idea on how to accomplish this so that dataframes are concatenated based on their index numbers. Thanks
If need sorting by index use:
df = pd.concat([traindf,testdf]).sort_index()
Or if need sorting by column Class use:
df = pd.concat([traindf,testdf]).sort_values(by=['Class'])
If you want to copy all the columns then you can get the slice using loc and just overwrite it.
# Create some dummy dataframes
df1 = pd.DataFrame(
{
'Pclass': np.random.randint(0,10,10),
'Fare': np.random.randint(0,10,10),
'Age': np.random.randint(0,100,10)
})
df2 = copy.deepcopy(df1[df1['Fare']%2 == 0]*1.5)
print (df1, df2)
# Owerwrite df1 with df2
for i in df2.index:
if i in df1.index:
df1.loc[i] = df2.loc[i]
print ("After overwrite")
print (df1)
Output:
Pclass Fare Age
0 7 6 25
1 8 3 34
2 0 4 57
3 9 1 98
4 3 5 58
5 8 0 97
6 9 6 53
7 2 0 1
8 0 5 33
9 2 9 36
Pclass Fare Age
0 10.5 9.0 37.5
2 0.0 6.0 85.5
5 12.0 0.0 145.5
6 13.5 9.0 79.5
7 3.0 0.0 1.5
After overwrite
Pclass Fare Age
0 10.5 9.0 37.5
1 8.0 3.0 34.0
2 0.0 6.0 85.5
3 9.0 1.0 98.0
4 3.0 5.0 58.0
5 12.0 0.0 145.5
6 13.5 9.0 79.5
7 3.0 0.0 1.5
8 0.0 5.0 33.0
9 2.0 9.0 36.0
You could possibly fill in the ages without splitting the dataframes. But if you have to split them, then you can use the following:
pd.concat([traindf, testdf], sort=False).sort_index()

I am trying to replace NaN values with mean values

i have to replace the s_months and incidents NaN values with the corresponding means in jupyter notebook.
Input data :
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
I have tried the code below but it does not seem to work and I have tried different variations such as replacing the transform.
df.fillna['s_months'] = df.fillna(df.grouby(['types' , 'o_periods']['s_months','incidents']).tranform('mean'),inplace = True)
s_months incidents
Types o_periods
1 1 911 3
2 1688 8
2 1 26851 36
2 14440 36
3 1 914 2
2 862 1
4 1 296 0
2 889 3
5 1 663 4
2 1046 6
From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
Types,c_years,o_periods,s_months,incidents
0,1,1,1,127.0,0.0
1,1,1,2,63.0,0.0
2,1,2,1,1095.0,3.0
3,1,2,2,1095.0,4.0
4,1,3,1,1512.0,6.0
5,1,3,2,3353.0,18.0
6,1,4,1,NaN,NaN
7,1,4,2,2244.0,11.0
14,2,4,1,NaN,NaN"""), sep=',')
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
>>> df[['c_years', 's_months', 'incidents']] = df.groupby(['Types', 'o_periods']).transform(lambda x: x.fillna(x.mean()))
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN
The last NaN is here because it belongs to the last group which contains no value in the columns s_months and incidents and therefore, no mean.
Try this df['s_months'].fillna(df['s_months'].mean())
df['s_months'].mean() counts mean without Nan.
Your code is close, you can try modify it as follows to make it work:
df[['s_months','incidents']] = df[['s_months','incidents']].fillna(df.groupby(['Types' , 'o_periods'])[['s_months','incidents']].transform('mean'))
Data Input:
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
Output
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN

How to vectorize a non-overlapped dataframe to overlapped shiftting dataframe?

I would like to transform a regular dataframe to a multi-index dataframe with overlap and shift.
For example, the input dataframe is like this sample code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(0, 12).reshape(-1, 2), columns=['d1', 'd2'], dtype=float)
df.index.name = 'idx'
print(df)
Output:
d1 d2
idx
0 0.0 1.0
1 2.0 3.0
2 4.0 5.0
3 6.0 7.0
4 8.0 9.0
5 10.0 11.0
What I want to output is: Make it overlap by batch and shift one row per time (Add a column batchid to label every shift), like this (batchsize=4):
d1 d2
idx batchid
0 0 0.0 1.0
1 0 2.0 3.0
2 0 4.0 5.0
3 0 6.0 7.0
1 1 2.0 3.0
2 1 4.0 5.0
3 1 6.0 7.0
4 1 8.0 9.0
2 2 4.0 5.0
3 2 6.0 7.0
4 2 8.0 9.0
5 2 10.0 11.0
My work so far:
I can make it work with iterations and concat them together. But it will take a lot of time.
batchsize = 4
ds, ids = [], []
idx = df.index.values
for bi in range(int(len(df) - batchsize + 1)):
ids.append(idx[bi:bi+batchsize])
for k, idx in enumerate(ids):
di = df.loc[pd.IndexSlice[idx], :].copy()
di['batchid'] = k
ds.append(di)
res = pd.concat(ds).fillna(0)
res.set_index('batchid', inplace=True, append=True)
Is there a way to vectorize and accelerate this process?
Thanks.
First we create a 'mask' that will tell us which elements go into which batch id
nrows = len(df)
batchsize = 4
mask_columns = {i:np.pad([1]*batchsize,(i,nrows-batchsize-i)) for i in range(nrows-batchsize+1)}
mask_df = pd.DataFrame(mask_columns)
df = df.join(mask_df)
this adds a few columns to df:
idx d1 d2 0 1 2
----- ---- ---- --- --- ---
0 0 1 1 0 0
1 2 3 1 1 0
2 4 5 1 1 1
3 6 7 1 1 1
4 8 9 0 1 1
5 10 11 0 0 1
This now looks like a df with 'dummies', and we need to 'reverse' the dummies:
df2 = df.set_index(['d1','d2'], drop=True)
df2[df2==1].stack().reset_index().drop(0,1).sort_values('level_2').rename(columns = {'level_2':'batchid'})
produces
d1 d2 batchid
-- ---- ---- ---------
0 0 1 0
1 2 3 0
3 4 5 0
6 6 7 0
2 2 3 1
4 4 5 1
7 6 7 1
9 8 9 1
5 4 5 2
8 6 7 2
10 8 9 2
11 10 11 2
You can accomplish this with list comprehension inside of a pd.concat with iloc using i as a variable that iterates through a range. This should be quicker:
batchsize = 4
df = (pd.concat([df.iloc[i:batchsize+i].assign(batchid=i)
for i in range(df.shape[0] - batchsize + 1)])
.set_index(['batchid'], append=True))
df
Out[1]:
d1 d2
idx batchid
0 0 0.0 1.0
1 0 2.0 3.0
2 0 4.0 5.0
3 0 6.0 7.0
1 1 2.0 3.0
2 1 4.0 5.0
3 1 6.0 7.0
4 1 8.0 9.0
2 2 4.0 5.0
3 2 6.0 7.0
4 2 8.0 9.0
5 2 10.0 11.0

Python pandas : groupby on two columns and create new variables

I have the following dataframe describing the percent of shares held by a type of investor in a company:
company investor pct
1 A 1
1 A 2
1 B 4
2 A 2
2 A 4
2 A 6
2 C 10
2 C 8
And I would like to create a new column for each investor type computing the mean of the shares held in each company. I also need to keep the same lenght of the dataset, using transform for instance.
Here is the result I would like to have:
company investor pct pct_mean_A pct_mean_B pct_mean_C
1 A 1 1.5 4 0
1 A 2 1.5 4 0
1 B 4 1.5 4 0
2 A 2 4.0 0 9
2 A 4 4.0 0 9
2 A 6 4.0 0 9
2 C 10 4.0 0 9
2 C 8 4.0 0 9
Thanks a lot for your help!
Use groupby with aggregate mean and reshape by unstack for helper DataFrame which is join to original df:
s = (df.groupby(['company','investor'])['pct']
.mean()
.unstack(fill_value=0)
.add_prefix('pct_mean_'))
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4.0 0.0
1 1 A 2 1.5 4.0 0.0
2 1 B 4 1.5 4.0 0.0
3 2 A 2 4.0 0.0 9.0
4 2 A 4 4.0 0.0 9.0
5 2 A 6 4.0 0.0 9.0
6 2 C 10 4.0 0.0 9.0
7 2 C 8 4.0 0.0 9.0
Or use pivot_table with default aggregate function mean:
s = df.pivot_table(index='company',
columns='investor',
values='pct',
fill_value=0).add_prefix('pct_mean_')
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4 0
1 1 A 2 1.5 4 0
2 1 B 4 1.5 4 0
3 2 A 2 4.0 0 9
4 2 A 4 4.0 0 9
5 2 A 6 4.0 0 9
6 2 C 10 4.0 0 9
7 2 C 8 4.0 0 9

Fast way to get the number of NaNs in a column counted from the last valid value in a DataFrame

Say I have a DataFrame like
A B
0 0.1880 0.345
1 0.2510 0.585
2 NaN NaN
3 NaN NaN
4 NaN 1.150
5 0.2300 1.210
6 0.1670 1.290
7 0.0835 1.400
8 0.0418 NaN
9 0.0209 NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
I want a new DataFrame of the same shape where each entry represents the number of NaNs counted up to its position started from the last valid value as follows
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
I wonder if this can be done efficiently by utilizing some of the Pandas/Numpy functions?
You can use:
a = df.isnull()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
print (df1)
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
For better understanding:
#add NaN where True in a
a2 = b.mask(a)
#forward filling NaN
a3 = b.mask(a).ffill()
#replace NaN to 0, cast to int
a4 = b.mask(a).ffill().fillna(0).astype(int)
#substract b to a4
a5 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
df1 = pd.concat([a,b,a2, a3, a4, a5], axis=1,
keys=['a','b','where','ffill nan','substract','output'])
print (df1)
a b where ffill nan substract output
A B A B A B A B A B A B
0 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
1 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
2 True True 1 1 NaN NaN 0.0 0.0 0 0 1 1
3 True True 2 2 NaN NaN 0.0 0.0 0 0 2 2
4 True False 3 2 NaN 2.0 0.0 2.0 0 2 3 0
5 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
6 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
7 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
8 False True 3 3 3.0 NaN 3.0 2.0 3 2 0 1
9 False True 3 4 3.0 NaN 3.0 2.0 3 2 0 2
10 True True 4 5 NaN NaN 3.0 2.0 3 2 1 3
11 True True 5 6 NaN NaN 3.0 2.0 3 2 2 4
12 True True 6 7 NaN NaN 3.0 2.0 3 2 3 5

Categories