I'd like to convert to below Df1 to Df2.
The empty values would be filled with Nan.
Below Dfs are examples.
My data has weeks from 1 to 8.
IDs are 100,000.
Only week 8 has all IDs, so total rows will be 100,000.
I have Df3 which has 100,000 of id, and I want to merge df1 on Df3 formatted as df2.
ex) pd.merge(df3, df1, on="id", how="left") -> but, formatted as df2
Df1>
wk, id, col1, col2 ...
1 1 0.5 15
2 2 0.5 15
3 3 0.5 15
1 2 0.5 15
3 2 0.5 15
------
Df2>
wk1, id, col1, col2, wk2, id, col1, col2, wk3, id, col1, col2,...
1 1 0.5 15 2 1 Nan Nan 3 1 Nan Nan
1 2 0.5 15 2 2 0.5 15 3 2 0.5 15
1 3 Nan Nan 2 3 Nan Nan 3 3 0.5 15
Use:
#create dictionary for rename columns for correct sorting
d = dict(enumerate(df.columns))
d1 = {v:k for k, v in d.items()}
#first add missing values for each `wk` and `id`
df1 = df.set_index(['wk', 'id']).unstack().stack(dropna=False).reset_index()
#for each id create DataFrame, reshape by unstask and rename columns
df1 = (df1.groupby('id')
.apply(lambda x: pd.DataFrame(x.values, columns=df.columns))
.unstack()
.reset_index(drop=True)
.rename(columns=d1, level=0)
.sort_index(axis=1, level=1)
.rename(columns=d, level=0))
#convert values to integers if necessary
df1.loc[:, ['wk', 'id']] = df1.loc[:, ['wk', 'id']].astype(int)
#flatten MultiIndex in columns
df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
print (df1)
wk_0 id_0 col1_0 col2_0 wk_1 id_1 col1_1 col2_1 wk_2 id_2 col1_2 \
0 1 1 0.5 15.0 2 1 NaN NaN 3 1 NaN
1 1 2 0.5 15.0 2 2 0.5 15.0 3 2 0.5
2 1 3 NaN NaN 2 3 NaN NaN 3 3 0.5
col2_2
0 NaN
1 15.0
2 15.0
You can use GroupBy + concat. The idea is to create a list of dataframes with appropriately named columns and appropriate index. The concatenate along axis=1:
d = {k: v.reset_index(drop=True) for k, v in df.groupby('wk')}
def formatter(df, key):
return df.rename(columns={'w': f'wk{key}'}).set_index('id')
L = [formatter(df, key) for key, df in d.items()]
res = pd.concat(L, axis=1).reset_index()
print(res)
id wk col1 col2 wk col1 col2 wk col1 col2
0 1 1.0 0.5 15.0 NaN NaN NaN NaN NaN NaN
1 2 1.0 0.5 15.0 2.0 0.5 15.0 3.0 0.5 15.0
2 3 NaN NaN NaN NaN NaN NaN 3.0 0.5 15.0
Note NaN forces your series to become float. There's no "good" fix for this.
Related
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed last month.
I have a dataframe as follows with multiple rows per id (maximum 3).
dat = pd.DataFrame({'id':[1,1,1,2,2,3,4,4], 'code': ["A","B","D","B","D","A","A","D"], 'amount':[11,2,5,22,5,32,11,5]})
id code amount
0 1 A 11
1 1 B 2
2 1 D 5
3 2 B 22
4 2 D 5
5 3 A 32
6 4 A 11
7 4 D 5
I want to consolidate the df and have only one row per id so that it looks as follows:
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11 B 2 D 5
1 2 B 22 D 5 NaN NaN
2 3 A 32 NaN NaN NaN NaN
3 4 A 11 D 5 NaN NaN
How can I acheive this in pandas?
Use GroupBy.cumcount for counter with reshape by DataFrame.unstack and DataFrame.sort_index, last flatten MultiIndex and convert id to column by DataFrame.reset_index:
df = (dat.set_index(['id',dat.groupby('id').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1, sort_remaining=False))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11.0 B 2.0 D 5.0
1 2 B 22.0 D 5.0 NaN NaN
2 3 A 32.0 NaN NaN NaN NaN
3 4 A 11.0 D 5.0 NaN NaN
I have to tables looks like following:
Table T1
ColumnA
ColumnB
A
1
A
3
B
1
C
2
Table T2
ColumnA
ColumnB
A
1
A
4
B
1
D
2
in SQL I will do following query to check the existence of each record
select
COALESCE(T1.ColumnA,T2.ColumnA) as ColumnA
,T1.ColumnB as ExistT1
,T2.ColumnB as ExistT2
from T1
full join T2 on
T1.ColumnA=T2.ColumnA
and T1.ColumnB=T2.ColumnB
where
(T1.ColumnA is null or T2.ColumnA is null)
I have tried many way in Pandas like concate, join, merge, etc, but it seems that the two merge keys would be combined into one.
I think the problem is that I want to check is not 'data columns' but 'key columns'.
Is there any good idea to do this in Python? Thanks!
ColumnA
ExistT1
ExistT2
A
3
null
A
null
4
C
2
null
D
null
2
pd.merge has an indicator parameter that could be helpful here:
(t1
.merge(t2, how = 'outer', indicator=True)
.loc[lambda df: df._merge!="both"]
.assign(ExistT1 = lambda df: df.ColumnB.where(df._merge.eq('left_only')),
ExistT2 = lambda df: df.ColumnB.where(df._merge.eq('right_only')) )
.drop(columns=['ColumnB', '_merge'])
)
ColumnA ExistT1 ExistT2
1 A 3.0 NaN
3 C 2.0 NaN
4 A NaN 4.0
5 D NaN 2.0
First
merge 2 dataframes
following code:
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer'))
output:
ColumnA ColumnB ExistT1 ExistT2
0 A 1 1.00 1.00
1 A 3 3.00 NaN
2 B 1 1.00 1.00
3 C 2 2.00 NaN
4 A 4 NaN 4.00
5 D 2 NaN 2.00
Second
drop ColumnB and same value rows (like row 0 and row2)
include full code:
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer')
.drop('ColumnB', axis=1)
.loc[lambda x: x.isnull().any(axis=1)])
output:
ColumnA ExistT1 ExistT2
1 A 3.00 NaN
3 C 2.00 NaN
4 A NaN 4.00
5 D NaN 2.00
Final
sort_values and reset_index (full code)
(df1.assign(ExistT1=df1['ColumnB'])
.merge(df2.assign(ExistT2=df2['ColumnB']), how='outer')
.drop('ColumnB', axis=1)
.loc[lambda x: x.isnull().any(axis=1)]
.sort_values(['ColumnA']).reset_index(drop=True))
result:
ColumnA ExistT1 ExistT2
0 A 3.00 NaN
1 A NaN 4.00
2 C 2.00 NaN
3 D NaN 2.00
I know codes forfilling seperately by taking each column as below
data['Native Country'].fillna(data['Native Country'].mode(), inplace=True)
But i am working on a dataset with 50 rows and there are 20 categorical values which need to be imputed.
Is there a single line code for imputing the entire data set??
Use DataFrame.fillna with DataFrame.mode and select first row because if same maximum occurancies is returned all values:
data = pd.DataFrame({
'A':list('abcdef'),
'col1':[4,5,4,5,5,4],
'col2':[np.nan,8,3,3,2,3],
'col3':[3,3,5,5,np.nan,np.nan],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
cols = ['col1','col2','col3']
print (data[cols].mode())
col1 col2 col3
0 4 3.0 3.0
1 5 NaN 5.0
data[cols] = data[cols].fillna(data[cols].mode().iloc[0])
print (data)
A col1 col2 col3 E F
0 a 4 3.0 3.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 3.0 5.0 6 a
3 d 5 3.0 5.0 9 b
4 e 5 2.0 3.0 2 b
5 f 4 3.0 3.0 4 b
I have a pandas DataFrame that looks similar to the following...
>>> df = pd.DataFrame({
... 'col1':['A','C','B','A','B','C','A'],
... 'col2':[np.nan,1.,np.nan,1.,1.,np.nan,np.nan],
... 'col3':[0,1,9,4,2,3,5],
... })
>>> df
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 1.0 2
5 C NaN 3
6 A NaN 5
What I would like to do is group the rows of col1 by value and then update any NaN values in col2 to increment in value by 1 based on the last highest value of that group in col1.
So that my expected results would look like the following...
>>> df
col1 col2 col3
0 A 1.0 4
1 A 2.0 0
2 A 3.0 5
3 B 1.0 2
4 B 2.0 9
5 C 1.0 1
6 C 2.0 3
I believe I can use something like groupby on col1 though I'm unsure how to increment the value in col2 based on the last highest value of the group from col1. I've tried the following, but instead of incrementing the value of col1 it updates the value to all 1.0 and adds an additional column...
>>> df1 = df.groupby(['col1'], as_index=False).agg({'col2': 'min'})
>>> df = pd.merge(df1, df, how='left', left_on=['col1'], right_on=['col1'])
>>> df
col1 col2_x col2_y col3
0 A 1.0 NaN 0
1 A 1.0 1.0 1
2 A 1.0 NaN 5
3 B 1.0 NaN 9
4 B 1.0 1.0 4
5 C 1.0 1.0 2
6 C 1.0 NaN 3
Use GroupBy.cumcount only for rows with missing values, add maximum value per group with GroupBy.transform and max and last replace by original values by fillna:
df = pd.DataFrame({
'col1':['A','C','B','A','B','B','B'],
'col2':[np.nan,1.,np.nan,1.,3.,np.nan, 0],
'col3':[0,1,9,4,2,3,4],
})
print (df)
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 3.0 2
5 B NaN 3
6 B 0.0 4
df = df.sort_values(['col1','col2'], na_position='last')
s = df.groupby('col1')['col2'].transform('max')
df['new'] = (df[df['col2'].isna()]
.groupby('col1')
.cumcount()
.add(1)
.add(s)
.fillna(df['col2']).astype(int))
print (df)
col1 col2 col3 new
3 A 1.0 4 1
0 A NaN 0 2
6 B 0.0 4 0
4 B 3.0 2 3
2 B NaN 9 4
5 B NaN 3 5
1 C 1.0 1 1
Another way:
df['col2_new'] = df.groupby('col1')['col2'].apply(lambda x: x.replace(np.nan, x.value_counts().index[0]+1))
df = df.sort_values('col1')
Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1
You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0
Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0