I want to reformat a dataframe by transeposing some columns with fixing other columns.
original data :
ID subID values_A
-- ----- --------
A aaa 10
B baa 20
A abb 30
A acc 40
C caa 50
B bbb 60
Pivot once :
pivot_table( df, index = ["ID", "subID"] )
Output:
ID subID values_A
-- ----- --------
A aaa 10
abb 30
acc 40
B baa 20
bbb 60
C caa 50
What I want to do ( Fix ['ID'] columns and partial transpose ) :
ID subID_1 value_1 subID_2 value_2 subID_3 value_3
-- ------- ------- -------- ------- ------- -------
A aaa 10 abb 30 acc 40
B baa 20 bbb 60 NaN NaN
C caa 50 NaN NaN NaN NaN
what I know max subIDs count value which are under each IDs.
I don't need any calculating value when pivot and transepose dataframe.
Please help
Use cumcount for counter, create MultiIndex by set_index, reshape by unstack and sort first level of MultiIndex in columns by sort_index. Last flatten it by list comprehension with reset_index:
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
#python 3.6+
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
#python bellow
#df.columns = ['{}_{}'.format(a, b+1) for a, b in df.columns]
df = df.reset_index()
print (df)
ID subID_1 values_A_1 subID_2 values_A_2 subID_3 values_A_3
0 A aaa 10.0 abb 30.0 acc 40.0
1 B baa 20.0 bbb 60.0 NaN NaN
2 C caa 50.0 NaN NaN NaN NaN
Related
Let's say I have 2 dataframes,
both have different lengths but the same amount of columns
df1 = pd.DataFrame({'country': ['Russia','Mexico','USA','Argentina','Denmark','Syngapore'],
'population': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'country': ['Russia','Argentina','Australia','USA'],
'population': [44,12,23,64]})
Lets assume that some of the data in df1 is outdated and I've received a new dataframe that contains some new data but not which may or may not exist already in the outdated dataframe.
I want to find out if any of the values of df2.country are inside df1.country
By doing the following I'm able to return a boolean:
df = df1.country.isin(df2.country)
print(df)
Unfortunately I'm just creating a new dataframe containing the answer to my question
0 True
1 False
2 True
3 True
4 False
5 False
Name: country, dtype: bool
My goal here is to delete the rows of df1 which values match with df2 and add the new data, kind of like an update.
I've manage to come up with something like this:
df = df1.country.isin(df2.country)
i = 0
for x in df:
if x:
df1.drop(i, inplace=True)
i += 1
frames = [df1, df2]
df1 = pd.concat(frames)
df1.reset_index(drop=True, inplace=True)
print(df1)
which in fact works and updates the dataframe
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
But I really believe there's a batter way of doing the same thing quicker and much more practical considering that the real dataframe is much bigger and updates every few seconds.
I'd love to hear some suggestions, Thanks!
Assuming col1 remains unique in the original dataframe, you can join the two tables together. Once you have them in the same dataframe, you can apply your logic i.e. update value from new dataframe if it is not null. You actually don't need to check if col2 has changed for every entry in col1. You can just replace col2 value with col1 as long as it is not NaN (based on your sample output).
df1 = pd.DataFrame({'col1': ['a','f','r','g','d','s'], 'col2': [41,12,26,64,123,24]})
df2 = pd.DataFrame({'col1': ['a','g','o','r'], 'col2': [44,12,23,64]})
# do the join
x= pd.merge(df1,df2,how='outer',
left_on="col1", right_on="col1")
col1 col2_x col2_y
0 a 41.0 44.0
1 f 12.0 NaN
2 r 26.0 64.0
3 g 64.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o NaN 23.0
# apply your update rules
x['col2_x'] = np.where(
~x['col2_y'].isnull(),
x['col2_y'],x['col2_x']
)
col1 col2_x col2_y
0 a 44.0 44.0
1 f 12.0 NaN
2 r 64.0 64.0
3 g 12.0 12.0
4 d 123.0 NaN
5 s 24.0 NaN
6 o 23.0 23.0
#clean up
x.drop("col2_y", axis=1, inplace = True)
x.columns = ["col1", "col2"]
col1 col2
0 a 44.0
1 f 12.0
2 r 64.0
3 g 12.0
4 d 123.0
5 s 24.0
6 o 23.0
The isin approach is so close! Simply use the results from isin as a mask, then concat the rows from df1 that are not in (~) df2 with the rest of df2:
m = df1['country'].isin(df2['country'])
df3 = pd.concat((df1[~m], df2), ignore_index=True)
df3:
country population
0 Mexico 12
1 Denmark 123
2 Syngapore 24
3 Russia 44
4 Argentina 12
5 Australia 23
6 USA 64
I have two pandas data frames df1 and df2
**df1** **df2**
cat id frequency id (other cols) A B C
A 23 2 23 ............. nan nan nan
A 43 8 43 ............. nan nan nan
B 23 56 30 ............. nan nan nan
C 30 4
I am looking for a way on how to extract information form df1 to df2 resulting in the format below, where the values of columns A, B and C are the frequency values from df1
**df2**
id (other cols) A B C
30 .......... 0 0 4
23 .......... 2 56 0
43 .......... 8 0 0
Use DataFrame.pivot with DataFrame.combine_first:
df11 = df1.pivot('cat', 'id', 'frequency')
#if id is column
df = df2.set_index('id').combine_first(df11)
#if id is index
df = df2.combine_first(df11)
EDITED 3/5/19:
Tried different ways to merge and/or join the data below but couldn't wrap my head around how to do that correctly.
Initially I have a data like this:
index unique_id group_name id name
0 100 ABC 20 aaa
1 100 ABC 21 bbb
2 100 DEF 22 ccc
3 100 DEF 23 ddd
4 100 DEF 24 eee
5 100 DEF 25 fff
6 101 ABC 30 ggg
7 101 ABC 31 hhh
8 101 ABC 32 iii
9 101 DEF 33 jjj
The goal is to reshape it by merging on unique_id so that the result looks like this:
index unique_id group_name_x id_x name_x group_name_y id_y name_y
0 100 ABC 20 aaa DEF 22 ccc
1 100 ABC 21 bbb DEF 23 ddd
2 100 NaN NaN NaN DEF 24 eee
3 100 NaN NaN NaN DEF 25 fff
4 101 ABC 30 ggg DEF 33 jjj
5 101 ABC 31 hhh NaN NaN NaN
6 101 ABC 32 iii NaN NaN NaN
How can I do this in pandas? The best I could think of is to split the data into two dataframes by group name (ABC and DEF) and then merge them with how='outer', on='unique_id', but that way it creates references between each record (2 ABC x 4 DEF = 8 records) without any NaN's.
pd.concat with axis=1 mentioned in answers doesn't align the data per unique_id and doesn't create any NaN's.
As you said , split the dataframe then concat both dataframe by row wise after resetting both index
A working code,
df=pd.read_clipboard()
req_cols=['group_name','id','name']
df_1=df[df['group_name']=='ABC'].reset_index(drop=True)
df_2=df[df['group_name']=='DEF'].reset_index(drop=True)
df_1=df_1.rename(columns = dict(zip(df_1[req_cols].columns.values, df_1[req_cols].add_suffix('_x'))))
df_2=df_2.rename(columns = dict(zip(df_2[req_cols].columns.values, df_2[req_cols].add_suffix('_y'))))
req_cols_x=[val+'_x'for val in req_cols]
print (pd.concat([df_2,df_1[req_cols_x]],axis=1))
O/P:
index unique_id group_name_y id_y name_y group_name_x id_x name_x
0 2 100 DEF 22 ccc ABC 20.0 aaa
1 3 100 DEF 23 ddd ABC 21.0 bbb
2 4 100 DEF 24 eee NaN NaN NaN
3 5 100 DEF 25 fff NaN NaN NaN
I'd like to convert to below Df1 to Df2.
The empty values would be filled with Nan.
Below Dfs are examples.
My data has weeks from 1 to 8.
IDs are 100,000.
Only week 8 has all IDs, so total rows will be 100,000.
I have Df3 which has 100,000 of id, and I want to merge df1 on Df3 formatted as df2.
ex) pd.merge(df3, df1, on="id", how="left") -> but, formatted as df2
Df1>
wk, id, col1, col2 ...
1 1 0.5 15
2 2 0.5 15
3 3 0.5 15
1 2 0.5 15
3 2 0.5 15
------
Df2>
wk1, id, col1, col2, wk2, id, col1, col2, wk3, id, col1, col2,...
1 1 0.5 15 2 1 Nan Nan 3 1 Nan Nan
1 2 0.5 15 2 2 0.5 15 3 2 0.5 15
1 3 Nan Nan 2 3 Nan Nan 3 3 0.5 15
Use:
#create dictionary for rename columns for correct sorting
d = dict(enumerate(df.columns))
d1 = {v:k for k, v in d.items()}
#first add missing values for each `wk` and `id`
df1 = df.set_index(['wk', 'id']).unstack().stack(dropna=False).reset_index()
#for each id create DataFrame, reshape by unstask and rename columns
df1 = (df1.groupby('id')
.apply(lambda x: pd.DataFrame(x.values, columns=df.columns))
.unstack()
.reset_index(drop=True)
.rename(columns=d1, level=0)
.sort_index(axis=1, level=1)
.rename(columns=d, level=0))
#convert values to integers if necessary
df1.loc[:, ['wk', 'id']] = df1.loc[:, ['wk', 'id']].astype(int)
#flatten MultiIndex in columns
df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
print (df1)
wk_0 id_0 col1_0 col2_0 wk_1 id_1 col1_1 col2_1 wk_2 id_2 col1_2 \
0 1 1 0.5 15.0 2 1 NaN NaN 3 1 NaN
1 1 2 0.5 15.0 2 2 0.5 15.0 3 2 0.5
2 1 3 NaN NaN 2 3 NaN NaN 3 3 0.5
col2_2
0 NaN
1 15.0
2 15.0
You can use GroupBy + concat. The idea is to create a list of dataframes with appropriately named columns and appropriate index. The concatenate along axis=1:
d = {k: v.reset_index(drop=True) for k, v in df.groupby('wk')}
def formatter(df, key):
return df.rename(columns={'w': f'wk{key}'}).set_index('id')
L = [formatter(df, key) for key, df in d.items()]
res = pd.concat(L, axis=1).reset_index()
print(res)
id wk col1 col2 wk col1 col2 wk col1 col2
0 1 1.0 0.5 15.0 NaN NaN NaN NaN NaN NaN
1 2 1.0 0.5 15.0 2.0 0.5 15.0 3.0 0.5 15.0
2 3 NaN NaN NaN NaN NaN NaN 3.0 0.5 15.0
Note NaN forces your series to become float. There's no "good" fix for this.
I made a game and got the players’s data like this:
StartTime Id Rank Score
2018-04-24 08:46:35.684000 aaa 1 280
2018-04-24 23:54:47.742000 bbb 2 176
2018-04-25 15:28:36.050000 ccc 1 223
2018-04-25 00:13:00.120000 aaa 4 79
2018-04-26 04:59:36.464000 ddd 1 346
2018-04-26 06:01:17.728000 fff 2 157
2018-04-27 04:57:37.701000 ggg 4 78
but I want to group it by day, just like this:
Date 2018/4/24 2018/4/25 2018/4/26 2018/4/27
ID aaa ccc ddd ggg
bbb aaa fff NaN
how do I group by date with Pandas?
Use set_index and cumcount:
df.set_index([df['StartTime'].dt.floor('D'),
df.groupby(df['StartTime'].dt.floor('D')).cumcount()])['Id'].unstack(0)
OUtput:
StartTime 2018-04-24 2018-04-25 2018-04-26 2018-04-27
0 aaa ccc ddd ggg
1 bbb aaa fff NaN
You can use cumcount to align index by group, then concat to concatenate series.
# normalize to zero out time
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.normalize()
# get unique days and make index count by group
cols = df['StartTime'].unique()
df.index = df.groupby('StartTime').cumcount()
# concatenate list comprehension of series
res = pd.concat([df.loc[df['StartTime'] == i, 'Id'] for i in cols], axis=1)
res.columns = cols
print(res)
2018-04-24 2018-04-25 2018-04-26 2018-04-27
0 aaa ccc ddd ggg
1 bbb aaa fff NaN
Performance
For smaller dataframes, use #ScottBoston's more succinct solution. For larger dataframes, concat seems to scale better than unstack:
def scott(df):
df['StartTime'] = pd.to_datetime(df['StartTime'])
return df.set_index([df['StartTime'].dt.floor('D'),
df.groupby(df['StartTime'].dt.floor('D')).cumcount()])['Id'].unstack(0)
def jpp(df):
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.normalize()
df.index = df.groupby('StartTime').cumcount()
res = pd.concat([df.loc[df['StartTime'] == i, 'Id'] for i in df['StartTime'].unique()], axis=1)
res.columns = cols
return res
df2 = pd.concat([df]*100000)
%timeit scott(df2) # 1 loop, best of 3: 681 ms per loop
%timeit jpp(df2) # 1 loop, best of 3: 271 ms per loop
import pandas as pd
df = pd.DataFrame({'StartTime': ['2018-04-01 15:25:11', '2018-04-04 16:25:11', '2018-04-04 15:27:11'], 'Score': [10, 20, 30]})
print(df)
This yields
Score StartTime
0 10 2018-04-01 15:25:11
1 20 2018-04-04 16:25:11
2 30 2018-04-04 15:27:11
Now we create a new column based on the StartTime column, which contains only the date:
df['Date'] = df['StartTime'].apply(lambda x: x.split(' ')[0])
print(df)
Output:
Score StartTime Date
0 10 2018-04-01 15:25:11 2018-04-01
1 20 2018-04-04 16:25:11 2018-04-04
2 30 2018-04-04 15:27:11 2018-04-04
We can now use the pd.DataFrame.groupby method to group the rows by the values of the new Datecolumn. In the example below, I first group the columns and then iterate over them to print the name (the value of the Date column of this group) and the mean score achieved:
for name, group in df.groupby('Date'):
print(name)
print(group)
print(group['Score'].mean())
Gives:
2018-04-01
Score StartTime Date
0 10 2018-04-01 15:25:11 2018-04-01
10.0
2018-04-04
Score StartTime Date
1 20 2018-04-04 16:25:11 2018-04-04
2 30 2018-04-04 15:27:11 2018-04-04
25.0
Edit: Since you initially did not provide the dataframe data in table format, I leave it as an exercise to you to adapt the dataframe in my answer ;-)