Pandas: how to unstack by group of columns while keeping columns paired - python

I need to unstack a contact list (id, relatives, phone numbers...) so that the columns keep a specific order.
Given an index, dataframe UNSTACK operates by unstacking single columns one by one, even when applied to couple of columns
Data have
df_have=pd.DataFrame.from_dict({'ID': {0: '100', 1: '100', 2: '100', 3: '100', 4: '100', 5: '200', 6: '200', 7: '200', 8: '200', 9: '200'},
'ID_RELATIVE': {0: '100', 1: '100', 2: '150', 3: '150', 4: '190', 5: '200', 6: '200', 7: '250', 8: '290', 9: '290'},
'RELATIVE_ROLE': {0: 'self', 1: 'self', 2: 'father', 3: 'father', 4: 'mother', 5: 'self', 6: 'self', 7: 'father', 8: 'mother', 9: 'mother'},
'PHONE': {0: '111111', 1: '222222', 2: '333333', 3: '444444', 4: '555555', 5: '123456', 6: '456789', 7: '987654', 8: '778899', 9: '909090'}})
Data want
df_want=pd.DataFrame.from_dict({'ID': {0: '100', 1: '200'},
'ID_RELATIVE_1': {0: '100', 1: '200'},
'RELATIVE_ROLE_1': {0: 'self', 1: 'self'},
'PHONE_1_1': {0: '111111', 1: '123456'},
'PHONE_1_2': {0: '222222', 1: '456789'},
'ID_RELATIVE_2': {0: '150', 1: '250'},
'RELATIVE_ROLE_2': {0: 'father', 1: 'father'},
'PHONE_2_1': {0: '333333', 1: '987654'},
'PHONE_2_2': {0: '444444', 1: 'nan'},
'ID_RELATIVE_3': {0: '190', 1: '290'},
'RELATIVE_ROLE_3': {0: 'mother', 1: 'mother'},
'PHONE_3_1': {0: '555555', 1: '778899'},
'PHONE_3_2': {0: 'nan', 1: '909090'}})
So, in the end, I need ID to be the index, and to unstack the other columns that will hence become attributes of ID.
The usual unstack process provides a "correct" ouput but in the wrong shape.
df2=have.groupby(['ID'])['ID_RELATIVE','RELATIVE_ROLE','PHONE'].apply(lambda x: x.reset_index(drop=True)).unstack()
This would require the re-ordering of columns and some removal of duplicates (by columns, not by row), together with a FOR loop. I'd like to avoid using this approach, since I'm looking for a more "elegant" way of achieving the desired result by means of grouping/stacking/unstacking/pivoting and so on.
Thanks a lot

Solution have main 2 steps - first grouping by all column without PHONE for pairs, convert columns names to ordered catagoricals for correct sorting and then grouping by ID:
c = ['ID','ID_RELATIVE','RELATIVE_ROLE']
df = df_have.set_index(c+ [df_have.groupby(c).cumcount().add(1)])['PHONE']
df = df.unstack().add_prefix('PHONE_').reset_index()
df = df.set_index(['ID', df.groupby('ID').cumcount().add(1)])
df.columns = pd.CategoricalIndex(df.columns, categories=df.columns.tolist(), ordered=True)
df = df.unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
ID ID_RELATIVE_1 RELATIVE_ROLE_1 PHONE_1_1 PHONE_2_1 ID_RELATIVE_2 \
0 100 100 self 111111 222222 150
1 200 200 self 123456 456789 250
RELATIVE_ROLE_2 PHONE_1_2 PHONE_2_2 ID_RELATIVE_3 RELATIVE_ROLE_3 PHONE_1_3 \
0 father 333333 444444 190 mother 555555
1 father 987654 NaN 290 mother 778899
PHONE_2_3
0 NaN
1 909090
If need change order of digits in PHONE columns:
df.columns = [f'{a.split("_")[0]}_{b}_{a.split("_")[1]}'
if 'PHONE' in a
else f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
ID ID_RELATIVE_1 RELATIVE_ROLE_1 PHONE_1_1 PHONE_1_2 ID_RELATIVE_2 \
0 100 100 self 111111 222222 150
1 200 200 self 123456 456789 250
RELATIVE_ROLE_2 PHONE_2_1 PHONE_2_2 ID_RELATIVE_3 RELATIVE_ROLE_3 PHONE_3_1 \
0 father 333333 444444 190 mother 555555
1 father 987654 NaN 290 mother 778899
PHONE_3_2
0 NaN
1 909090

Related

Pandas - Group by multiple columns and datetime

I have a df of tennis results and I would like to be able to see how many days its been since each player last won a game.
This is what my df looks like
Player 1
Player 2
Date
p1_win
p2_win
Murray
Nadal
2022-05-16
1
0
Nadal
Murray
2022-05-25
1
0
and this is what I want it to look like
Player 1
Player 2
Date
p1_win
p2_win
p1_lastwin
p2_lastwin
Murray
Nadal
2022-05-16
1
0
na
na
Nadal
Murray
2022-05-25
1
0
na
9
the results will have to be able to include the days since the last win whether the player was player 1 or 2 using group by I think. Also maybe if possible it would be good to have a win percentage for the year if possible.
Any help is much appreciated.
edit - here is the dict
{'Player 1': {0: 'Murray',
1: 'Nadal',
2: 'Murray',
3: 'Nadal',
4: 'Murray',
5: 'Nadal',
6: 'Murray',
7: 'Nadal',
8: 'Murray',
9: 'Nadal',
10: 'Murray'},
'Player 2': {0: 'Nadal',
1: 'Murray',
2: 'Nadal',
3: 'Murray',
4: 'Nadal',
5: 'Murray',
6: 'Nadal',
7: 'Murray',
8: 'Nadal',
9: 'Murray',
10: 'Nadal'},
'Date': {0: '2022-05-16',
1: '2022-05-26',
2: '2022-05-27',
3: '2022-05-28',
4: '2022-05-29',
5: '2022-06-01',
6: '2022-06-02',
7: '2022-06-05',
8: '2022-06-09',
9: '2022-06-13',
10: '2022-06-17'},
'p1_win': {0: '1',
1: '1',
2: '0',
3: '1',
4: '0',
5: '0',
6: '1',
7: '0',
8: '1',
9: '0',
10: '1'},
'p2_win': {0: '0',
1: '0',
2: '1',
3: '0',
4: '1',
5: '1',
6: '0',
7: '1',
8: '0',
9: '1',
10: '0'}}
Thanks :)
I leveraged pd.merge_asof to find the latest win, and then did a merge to the relevant index.
df = pd.DataFrame({'Player 1': {0: 'Murray', 1: 'Nadal', 2: 'Murray', 3: 'Nadal', 4: 'Murray', 5: 'Nadal', 6: 'Murray'}, 'Player 2': {0: 'Nadal', 1: 'Murray', 2: 'Nadal', 3: 'Murray', 4: 'Nadal', 5: 'Murray', 6: 'Nadal'}, 'Date': {0: '2022-05-16', 1: '2022-05-26', 2: '2022-05-27', 3: '2022-05-28', 4: '2022-05-29', 5: '2022-06-01', 6: '2022-06-02'}, 'p1_win': {0: '1', 1: '1', 2: '0', 3: '1', 4: '0', 5: '0', 6: '1'}, 'p2_win': {0: '0', 1: '0', 2: '1', 3: '0', 4: '1', 5: '1', 6: '0'}})
df['p1_win']=df.p1_win.astype(int)
df['p2_win']=df.p2_win.astype(int)
df['Date'] = pd.to_datetime(df['Date'])
df['match'] = [x+'_'+y if x>y else y+'_'+x for x,y in zip(df['Player 1'],df['Player 2'])]
# df['winner'] = np.where(df.p1_win==1,df['Player 1'],df['Player 2'])
# df['looser'] = np.where(df.p1_win==0,df['Player 1'],df['Player 2'])
df = df.reset_index()
df = df.sort_values(by='Date')
df = pd.merge_asof(df,df[df.p1_win==1][['match','Date','index']],by=['match'],on='Date',suffixes=('','_latest_win_p1'),allow_exact_matches=False,direction='backward')
df = pd.merge_asof(df,df[df.p2_win==1][['match','Date','index']],by=['match'],on='Date',suffixes=('','_latest_win_p2'),allow_exact_matches=False,direction='backward')
# df = df[['index','Date','Player 1','Player 2','p1_win','p2_win','match','winner','looser','index_latest_win_p2','index_latest_win_p1']]
df = df.merge(df[['Date','index','match']],how='left',left_on=['index_latest_win_p1','match'],right_on=['index','match'],suffixes=('','_latest_win_winner'))
df = df.merge(df[['Date','index','match']],how='left',left_on=['index_latest_win_p2','match'],right_on=['index','match'],suffixes=('','_latest_win_looser'))
df['days_since_last_win_winner'] = (df['Date']-df.Date_latest_win_winner).dt.days
df['days_since_last_win_looser'] = (df['Date']-df.Date_latest_win_looser).dt.days
not sure that this is exactly what you meant but let me know if you need anything else:

Convert column to date format

I am trying to convert the date to a correct date format. I have tested some of the possibilities that I have read in the forum but, I still don't know how to tackle this issue:
After importing:
df = pd.read_excel(r'/path/df_datetime.xlsb', sheet_name="12FEB22", engine='pyxlsb')
I get the following df:
{'Unnamed: 0': {0: 'Administrative ID',
1: '000002191',
2: '000002382',
3: '000002434',
4: '000002728',
5: '000002826',
6: '000003265',
7: '000004106',
8: '000004333'},
'Unnamed: 1': {0: 'Service',
1: 'generic',
2: 'generic',
3: 'generic',
4: 'generic',
5: 'generic',
6: 'generic',
7: 'generic',
8: 'generic'},
'Unnamed: 2': {0: 'Movement type',
1: 'New',
2: 'New',
3: 'New',
4: 'Modify',
5: 'New',
6: 'New',
7: 'New',
8: 'New'},
'Unnamed: 3': {0: 'Date',
1: 37503,
2: 37475,
3: 37453,
4: 44186,
5: 37711,
6: 37658,
7: 37770,
8: 37820},
'Unnamed: 4': {0: 'Contract Term',
1: '12',
2: '12',
3: '12',
4: '12',
5: '12',
6: '12',
7: '12',
8: '12'}}
However, even although I have tried to convert the 'Date' Column (or 'Unnamed 3', because the original dataset hasn't first row so I have to change the header after that) during the importation, it has been unsuccessful.
Is there any option that I can do?
Thanks!
try this:
from xlrd import xldate_as_datetime
def trans_date(x):
if isinstance(x, int):
return xldate_as_datetime(x, 0).date()
else:
return x
print(df['Unnamed: 3'].apply(trans_date))
>>>
0 Date
1 2002-09-04
2 2002-08-07
3 2002-07-16
4 2020-12-21
5 2003-03-31
6 2003-02-06
7 2003-05-29
8 2003-07-18
Name: Unnamed: 3, dtype: object

How to make a nested Dictionary from a Pandas data frame suing several columns?

I am trying to create a nested dictionary from a pandas data frame.
I have this data frame:
id1 ids1 Name1 Name2 ids2 ID col1 Goal col2 col3
0 85643 234,34,11223,345,345_2 aasd1 vaasd1 2234,354,223,35,3435 G-0001 1 NaN 3 1
1 85644 2343,355,121,34 aasd2 G-0002 2 56.0000 4 22
2 8564312 24 , 23 ,244 ,2421 ,567 ,789 aabsd1 G-00023 3 NaN 32 33
3 8564314 87 ,35 ,67_1 aabsd2 averabsd 387 ,355 ,667_1 G-01034 4 89.0000 43 44
df.to_dict()
#Here is wht you requested
{'id1 ': {0: 85643, 1: 85644, 2: 8564312, 3: 8564314},
'ids1 ': {0: '234,34,11223,345,345_2 ',
1: '2343,355,121,34 ',
2: '24 , 23 ,244 ,2421 ,567 ,789',
3: '87 ,35 ,67_1 '},
'Name1': {0: 'aasd1 ', 1: 'aasd2 ', 2: 'aabsd1', 3: 'aabsd2'},
'Name2': {0: 'vaasd1 ', 1: ' ', 2: ' ', 3: 'averabsd'},
'ids2': {0: '2234,354,223,35,3435',
1: ' ',
2: ' ',
3: ' 387 ,355 ,667_1 '},
'ID': {0: 'G-0001 ', 1: 'G-0002 ', 2: 'G-00023', 3: 'G-01034'},
'col1': {0: 1, 1: 2, 2: 3, 3: 4},
'Goal ': {0: ' NaN ', 1: 56, 2: ' NaN ', 3: 89},
'col2': {0: 3, 1: 4, 2: 32, 3: 43},
'col3': {0: 1, 1: 22, 2: 33, 3: 44}}
Each row in the "ID" column needs to be the key. inside that dictionary, the 'Name1' column and the 'Name2' columns need to be there as a list. 'Name1' column list is associated with the "ids1" column and the 'Name2' column list is associated with the "ids2" column.
I also need to put the "ID" column name inside that list too.
So I want to create a nested dictionary-like below.
mapper={
"G-0001":{"aasd1":['G-0001','234','34','11223','345','345_2'],
"vaasd1":['G-0001','2234','354','223','35','3435']},
"G-0002":{"aasd2":['G-0002','2343','355','121','34']},
"G-00023":{"aabsd1":['G-00023','24' , '23' ,'244' ,'2421' ,'567' ,'789']},
"G-01034":{"aabsd2":['G-01034','87' ,'35' ,'67_1'],
"averabsd":['G-01034','387' ,'355' ,'667_1']}
}
Is it possible to create that?
Can someone give me an idea, please?
Anything is appreciated. Thanks in advance!
Try:
Convert DataFrame from wide to long format
Drop rows without "Name" and append "ID" to "ids"
groupby and construct the required output dictionary.
#remove extra spaces from column names
df.columns = df.columns.str.strip()
#assign and index and convert DataFrame from wide to long format
df["idx"] = df.index
wtl = pd.wide_to_long(df, ["Name","ids"], "idx","j")
#drop rows without Name
wtl = wtl[wtl["Name"].str.strip().str.len().gt(0)]
#append ID and clean up the ids column
wtl["ids"] = wtl["ID"]+","+wtl["ids"]
wtl["ids"] = wtl["ids"] = wtl["ids"].str.split("\s?,\s?")
#groupby and construct required dictionary
output = wtl.groupby("ID").apply(lambda x: dict(zip(x["Name"],x["ids"]))).to_dict()
>>> output
{'G-0001': {'aasd1': ['G-0001', '234', '34', '11223', '345', '345_2'],
'vaasd1': ['G-0001', '2234', '354', '223', '35', '3435']},
'G-0002': {'aasd2': ['G-0002', '2343', '355', '121', '34']},
'G-00023': {'aabsd1': ['G-00023', '24', '23', '244', '2421', '567', '789']},
'G-01034': {'aabsd2': ['G-01034', '87', '35', '67_1'],
'averabsd': ['G-01034', '387', '355', '667_1']}}

How to perform string operation on an entire Pandas MultiIndex?

I have a pandas dataframe with a two-level column index. It's read in from a spreadsheet where the author used a lot of whitespace to accomplish things like alignment (for example, one column is called 'Tank #').
I've been able to remove the whitespace on the levels individually...
level0 = raw.columns.levels[0].str.replace('\s', '', regex=True)
level1 = raw.columns.levels[1].str.replace('\s', '', regex=True)
raw.columns.set_levels([level0, level1], inplace=True)
...but I'm curious if there is a way to do it without having to change each individual level one at a time.
I tried raw.columns.set_levels(raw.columns.str.replace('\s', '', regex=True)
but got AttributeError: Can only use .str accessor with Index, not MultiIndex.
Here is a small sample subset of the data-- my best attempt at SO table formatting :D, followed by a picture where I've highlighted in yellow the indices as received.
Run Info
Run Info
Run Data
Run Data
run #
Tank #
Step A
conc. %
ph
0
6931
5
5.29
33.14
1
6932
1
5.28
33.13
2
6933
2
5.32
33.40
3
6934
3
5.19
32.98
Thanks for any insight!
Edit: adding to_dict()
df.to_dict()
Out[5]:
{'Unnamed: 0': {0: nan, 1: 0.0, 2: 1.0, 3: 2.0, 4: 3.0, 5: 4.0},
'Run Info': {0: 'run #',
1: '6931',
2: '6932',
3: '6933',
4: '6934',
5: '6935'},
'Run Info.1': {0: 'Tank #',
1: '5',
2: '1',
3: '2',
4: '3',
5: '4'},
'Run Data': {0: 'Step A\npH',
1: '5.29',
2: '5.28',
3: '5.32',
4: '5.19',
5: '5.28'},
'Run Data.1': {0: 'concentration',
1: '33.14',
2: '33.13',
3: '33.4',
4: '32.98',
5: '32.7'}}
How about rename:
import re
df.rename(columns=lambda x: re.sub('\s+', ' ', x.strip() ),inplace=True)
If you don't want to keep any of the spaces, you can just replace ' ' with ''.

How to convert if/else to np.where in pandas

My code is below
apply pd.to_numeric to the columns where supposed to int or float but coming as object. Can we convert more into pandas way like applying np.where
if df.dtypes.all() == 'object':
df=df.apply(pd.to_numeric,errors='coerce').fillna(df)
else:
df = df
A simple one liner is assign with selest_dtypes which will reassign existing columns
df.assign(**df.select_dtypes('O').apply(pd.to_numeric,errors='coerce').fillna(df))
np.where:
df[:] = (np.where(df.dtypes=='object',
df.apply(pd.to_numeric,errors='coerce').fillna(df),df)
Example (check Price column) :
d = {'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: '24000', 1: 'a', 2: '900'}}
df = pd.DataFrame(d)
print(df)
CusID Name Shop Price
0 1 Paul Pascal 24000
1 2 Mark Casio a
2 3 Bill Nike 900
df.to_dict()
{'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: '24000', 1: 'a', 2: '900'}}
(df.assign(**df.select_dtypes('O').apply(pd.to_numeric,errors='coerce')
.fillna(df)).to_dict())
{'CusID': {0: 1, 1: 2, 2: 3},
'Name': {0: 'Paul', 1: 'Mark', 2: 'Bill'},
'Shop': {0: 'Pascal', 1: 'Casio', 2: 'Nike'},
'Price': {0: 24000.0, 1: 'a', 2: 900.0}}
Equivalent of your if/else is df.mask
df_out = df.mask(df.dtypes =='O', df.apply(pd.to_numeric, errors='coerce')
.fillna(df))

Categories