Merging two multiindex dataframes

Merging two multiindex dataframes - python

I have 2 dataframes:
df1 = pd.DataFrame.from_dict({('category', ''): {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
(pd.Timestamp('2021-06-28 00:00:00'),
'metric_1'): {0: 4120.549999999999, 1: 11226.016666666665, 2: 25049.443333333333, 3: 18261.083333333332, 4: 2553.1208333333334, 5: 2843.01, 6: 73203.51333333334},
(pd.Timestamp('2021-06-28 00:00:00'), 'metric_2'): {0: 9907.79,
1: 7614.650000000001,
2: 13775.259999999998,
3: 13158.250000000004,
4: 1457.85,
5: 1089.5600000000002,
6: 38864.9},
(pd.Timestamp('2021-07-05 00:00:00'),
'metric_1'): {0: 5817.319999999998, 1: 10799.45, 2: 23521.51, 3: 22062.350833333334, 4: 1249.5974999999999, 5: 3229.77, 6: 52796.06083333332},
(pd.Timestamp('2021-07-05 00:00:00'), 'metric_2'): {0: 6321.21,
1: 5606.01,
2: 10239.689999999999,
3: 17476.600000000002,
4: 943.7199999999999,
5: 1410.33,
6: 29645.45}}).set_index('category')
df2 = pd.DataFrame.from_dict({'category': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
1: {0: 36234.035577957984,
1: 69078.07089184562,
2: 128879.5397517309,
3: 178376.63536908248,
4: 9293.956915067887,
5: 8184.780211399392,
6: 177480.74540313095},
2: {0: 37887.581678419825,
1: 72243.67956241772,
2: 134803.02342121338,
3: 186603.8963173654,
4: 9716.385738295368,
5: 8555.606693927,
6: 185658.87577993725}}).set_index('category')
First I change the column names of df2 to be the same as df
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
df2 = df2.rename(columns=date_mappings)
Then I try to merge it
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack(), left_index=True, right_index=True).sort_index(axis=1))
But I get an error:
ValueError: Cannot merge a Series without a name
What is my mistake?
My goal is to add columns from df2 to df1 in each week like so that df1 would have 3 columns instead of 2.
After using
c = [df2.columns.map(date_mappings.get), df2.columns]
df1.join(df2.set_axis(c, axis=1)).sort_index(axis=1)
I get the values appended to the end of the dataframe rather than to the same columns with the same week naming:
Maybe this could be an issue that df2 holds dates from 2021-06-28 to 2022-06-27 while df1 holds dates from 2020 to today.
Unwanted adding to the end of the df

Idea is create MultiIndex in both DataFrames:
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
#create MultiIndex in df2 with datetimes in first level
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
#removed unused levels, here category, so possible convert first leve to datetimes
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05 \
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45
metric_3
category
A 37887.581678
B 72243.679562
C 134803.023421
D 186603.896317
E 9716.385738
F 8555.606694
G 185658.875780
If need remove greater datetimes like maximal df1 datetimes use:
#change mapping for test
date_mappings = {
1 : '2021-06-28',
2 : '2022-07-05'}
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
df2 = df2.loc[:, df2.columns.get_level_values(0) <= df1.columns.get_level_values(0).max()]
print (df2)
2021-06-28
metric_3
category
A 36234.035578
B 69078.070892
C 128879.539752
D 178376.635369
E 9293.956915
F 8184.780211
G 177480.745403
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45

Use pd.DataFrame.reindex + pd.DataFrame.join
reindex has a convenient level parameter that allows you to expand on the index levels not present.
df1.join(df2.reindex(df1.index, level=0))

I am not sure if this is what you want but you might need to_frame:
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack().to_frame(), left_index=True, right_index=True).sort_index(level=0))
print(df)

Related

If column value contains symbol, retain only the substring after the symbol

If index column value is separated by ;, then retain only the substring after the ;. Else, retain as-is. Would be even better if its in list comprehension.
My code raised ValueError: Length of values (4402) does not match length of index (22501).
# If gene name is separated by ";", then get the substring after the ";"
list = []
for i in meth["Name"]:
if ";" in i:
list.append(i.split(";",1)[1])
else:
continue
meth["Name"] = list
Traceback:
--> 532 "Length of values "
533 f"({len(data)}) "
534 "does not match length of index "
ValueError: Length of values (4402) does not match length of index (22501)
Sample data:
meth.iloc[0:4,0:4].to_dict()
{'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297,
1: 0.786837244239289,
2: 0.5310546143038515,
3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566,
1: 0.386177267500376,
2: 0.5086236274690276,
3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055,
1: 0.54983504208745,
2: 0.5352071929258406,
3: 0.6139719037555759}}

Are you trying to perform this operation on the column names, or to values of a specific column?
Either way, I think this will do the job:
import pandas as pd
# Define the example dataframe
df = pd.DataFrame(
{
'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
2: 0.5310546143038515, 3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
2: 0.5086236274690276, 3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
2: 0.5352071929258406, 3: 0.6139719037555759}
}
)
# Original dataframe:
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A
0 A1BG 0.278916 0.318497 0.400120
1 A1CF 0.786837 0.386177 0.549835
2 A2BP1 0.531055 0.508624 0.535207
3 A2LD1 0.711916 0.403601 0.613972
"""
# Replacing column names, using '-' as separator:
df.columns = df.columns.astype(str).str.split('-').str[-1]
# Modified dataframe:
df
"""
index 01A 01A 01A
0 A1BG 0.278916 0.318497 0.400120
1 A1CF 0.786837 0.386177 0.549835
2 A2BP1 0.531055 0.508624 0.535207
3 A2LD1 0.711916 0.403601 0.613972
"""
You can apply the same logic to your dataframe index, or specific columns:
df = pd.DataFrame(
{
'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
2: 0.5310546143038515, 3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
2: 0.5086236274690276, 3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
2: 0.5352071929258406, 3: 0.6139719037555759},
'name': {0: 'A;B;C', 1: 'AAA', 2: 'BBB', 3: 'C-DE'}
}
)
# Original dataframe:
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A name
0 A1BG 0.278916 0.318497 0.400120 A;B;C
1 A1CF 0.786837 0.386177 0.549835 AAA
2 A2BP1 0.531055 0.508624 0.535207 BBB
3 A2LD1 0.711916 0.403601 0.613972 C-DE
"""
# Splitting values from column "name":
df['name'] = df['name'].astype(str).str.split(';').str[-1]
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A name
0 A1BG 0.278916 0.318497 0.400120 C
1 A1CF 0.786837 0.386177 0.549835 AAA
2 A2BP1 0.531055 0.508624 0.535207 BBB
3 A2LD1 0.711916 0.403601 0.613972 C-DE
"""
Note
Please note that if the column values hold multiple repetitions of the same separator (e.g.: "A;B;C"), only the last substring gets returned (for "A;B;C", returns "C").

How to manipulate this data frame in pandas python

I am trying the take values from a data frame and make it into another. It is hard to explain what I am doing but I have made an example below. Please can someone help as I have lots of columns of I would like to reduce to a few. I want to end up with matrix pd.concat([df1,df2]). from df.
Pre is a factor with 2 levels, 0forPOST or 1 for PRE, SPC is a factor with many levels.
Thank you
df = pd.DataFrame({'CPDID': {0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE_SPC1': {0: 'NaN', 1: 'NaN', 2: 'NaN'},
'POST_SPC2': {0:10, 1:50, 2:80},
'POST_SPC3': {0:30, 1:40, 2:10}})
df1 = pd.DataFrame({'CPDID':{0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE': {0: 1, 1: 1, 2: 1},
'SPC': {0:1, 1:1, 2:1},
'Damage': {0:'NaN', 1:'NaN', 2:'NaN'}})
df2 = pd.DataFrame({'CPDID':{0: 'C1', 1: 'C2', 2: 'C2'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE': {0: 0, 1: 0, 2: 0},
'SPC': {0:2, 1:2, 2:2},
'Damage': {0:10, 1:50, 2:80}})
print(df)
print(pd.concat([df1,df2]))
print(df)
print(pd.concat([df1,df2]))

The core step is to transform the dataframe by .stack(). However, your desired dataframe requires quite a few steps to transform and extract column label values from the base df, as follows:
df = pd.DataFrame({'CPDID': {0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE_SPC1': {0: 'NaN', 1: 'NaN', 2: 'NaN'},
'POST_SPC2': {0:10, 1:50, 2:80},
'POST_SPC3': {0:30, 1:40, 2:10}})
df_out = df.set_index(['CPDID', 'Rate'])
# split 'PRE'/'POST' from 'SPCn' from column labels
df_out.columns = df_out.columns.str.split('_', expand=True)
# prepare for column name `PRE', 'SPC' for the related columns
df_out = df_out.rename_axis(('PRE', 'SPC'), axis=1)
# main step to transform df by stacking and name the values as 'Damage'
df_out = df_out.stack(level=[0,1]).reset_index(name='Damage')
# transform the values of 'PRE'
df_out['PRE'] = df_out['PRE'].eq('PRE').astype(int)
# extract number from 'SPCn'
df_out['SPC'] = df_out['SPC'].str.extract(r'(\d)$')
# sort to the required sequence
df_out = df_out.sort_values('SPC', ignore_index=True)
Result:
print(df_out)
CPDID Rate PRE SPC Damage
0 C1 100 1 1 NaN
1 C2 500 1 1 NaN
2 C3 200 1 1 NaN
3 C1 100 0 2 10.0
4 C2 500 0 2 50.0
5 C3 200 0 2 80.0
6 C1 100 0 3 30.0
7 C2 500 0 3 40.0
8 C3 200 0 3 10.0

Increasing a value during merge in pandas

I have 2 dataframes
df1
product_id value name
abc 10 a
def 20 b
ggg 10 c
df2
Which I get after using df2.groupby(['prod_id'])['code'].count().reset_index()
prod_id code
abc 10
def 20
ggg 10
ooo 5
hhh 1
I want to merge values from df2 to df1 left on product_id, right on prod_id.
To get:
product_id value name
abc 20 a
def 40 b
ggg 20 c
I tried:
pd.merge(df1, df2.groupby(['prod_id'])['code'].count().reset_index(),
left_on='product_id', right_on='prod_id', how='left')
Which returns df1 with 2 additional columns prod_id and code with the code column holding the amount by which I would like to increase value in df1. Now I can just add those 2 columns but I would like to avoid that.

Here’s one alternative:
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
Complete example:
df1 = pd.DataFrame({'product_id': {0: 'abc', 1: 'def', 2: 'ggg'},
'value': {0: 10, 1: 20, 2: 10},
'name': {0: 'a', 1: 'b', 2: 'c'}})
df2 = pd.DataFrame({'prod_id': {0: 'abc', 1: 'def', 2: 'ggg', 3: 'ooo', 4: 'hhh'},
'code': {0: 10, 1: 20, 2: 10, 3: 5, 4: 1}})
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
OUTPUT:
product_id value name
0 abc 20 a
1 def 40 b
2 ggg 20 c

you could use reindex on df2 with the order of df1 product_id, after the groupby.count (without the reset_index). like
df1['value'] += (
df2.groupby(['prod_id'])
['code'].count()
.reindex(df1['product_id'], fill_value=0)
.to_numpy()
)

Compare two columns in two dataframes with a condition on another column

I have a multilevel dataframe and I want to compare the value in column secret with a condition on column group. If group = A, we allow the value in another dataframe to be empty or na. Otherwise, output INVALID for the mismatching ones.
multilevel dataframe:
name secret group
df1 df2 df1 df2 df1 df2
id
1 Tim Tim random na A A
2 Tom Tom tree A A
3 Alex Alex apple apple B B
4 May May file cheese C C
expected output for secret
id name secret group
1 Tim na A
2 Tom A
3 Alex apple B
4 May INVALID C
so far I have:
result_df['result'] = multilevel_df.groupby(level=0, axis=0).apply(lambda x: secret_check(x))
#take care of the rest by compare column by column
result_df = multilevel_df.groupby(level=0, axis=1).apply(lambda x: validate(x))
def validate(x):
if x[0] == x[1]:
return x[1]
else:
return 'INVALID'
def secret_check(x):
if (x['group'] == 'A' and pd.isnull(['secret']): #this line is off
return x[1]
elif x[0] == x[1]:
return x[1]
else:
return 'INVALID'

Assuming we have the following dataframe:
df = pd.DataFrame({0: {0: 1, 1: 2, 2: 3, 3: 4},
1: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
2: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
3: {0: 'random', 1: 'tree', 2: 'apple', 3: 'file'},
4: {0: 'na', 1: '', 2: 'apple', 3: 'cheese'},
5: {0: 'A', 1: 'A', 2: 'B', 3: 'C'},
6: {0: 'A', 1: 'A', 2: 'B', 3: 'C'}})
df
df.columns = pd.MultiIndex.from_tuples([('id',''), ('name', 'df1'), ('name', 'df2'),
('secret', 'df1'), ('secret', 'df2'), ('group', 'df1'), ('group', 'df2')])
df
In[1]:
id name secret group
df1 df2 df1 df2 df1 df2
0 1 Tim Tim random na A A
1 2 Tom Tom tree A A
2 3 Alex Alex apple apple B B
3 4 May May file cheese C C
You can use np.select() to return results based on conditions.
.droplevel() to get out of a multiindex dataframe
and df.loc[:,~df.columns.duplicated()] to drop duplicate columns. Since we are setting the answer to df1 columns, df2 columns are not needed.
df[('secret', 'df1')] = np.select([(df[('group', 'df2')] != 'A') &
(df[('secret', 'df1')] != df[('secret', 'df2')])], #condition 1
[df[('secret', 'df1')] + ' > ' + df[('secret', 'df2')]], #result 1
df[('secret', 'df2')]) #alterantive if conditions not met
df.columns = df.columns.droplevel(1)
df = df.loc[:,~df.columns.duplicated()]
df
Out[1]:
id name secret group
0 1 Tim na A
1 2 Tom A
2 3 Alex apple B
3 4 May file > cheese C

If I understand you right, you want to mark "secret" in df2 as invalid if the secrets in df1 and df2 differ and the group is not A. There you go:
condition = (df[('secret', 'df1')] != df[('secret', 'df2')]) &\
df[('group', 'df1')] != 'A')
df.loc[condition, ('secret', 'df2')] = 'INVALID'

Python: Rows to Column Conversion in pandas

Need help in converting rows answers to columns in python.Given below is sample dataset.
Thanks for Help..
ID| date |question_id |Choice_id| answer
1 | 2020-01-01 | 471362125 |NAN | 29720950
2 | 2020-01-01 | 471362121 |311470023| 8
3 | 2020-01-01 | 471362120 |311470024| 9
4 | 2020-01-01 | 471362524 |312472025| 5
5 | 2020-01-01 | 471362122 |NAN. | Delivery Issue
Expected output
id|date|471362125_nan|471362121_311470023|471362120_311470024|471362524_312472025|471362122_NAN
1 | 2020-01-01| 29720950|8|9|5|Delivery Issue
I will rename this with question text using rename in pandas

You could do it with the brute force technique using a lot of iloc, setting the column names appending and resetting the index. The main idea is that the column names and first row mainly come from two columns, so you append those together horizontally:
input:
import pandas as pd
df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'date': {0: '2020-01-01',
1: '2020-01-01',
2: '2020-01-01',
3: '2020-01-01',
4: '2020-01-01'},
'question_id': {0: 471362125,
1: 471362121,
2: 471362120,
3: 471362524,
4: 471362122},
'Choice_id': {0: 'NAN',
1: '311470023',
2: '311470024',
3: '312472025',
4: 'NAN.'},
'answer': {0: '29720950', 1: '8', 2: '9', 3: '5', 4: 'Delivery Issue'}})
code:
df1 = df.copy()
d = df1['date'].min()
i = df1['ID'].min()
df1.columns = df1['question_id'].astype(str) + '_' + df1['Choice_id'].astype(str).to_list()
b = df1.columns.to_list()
a = pd.DataFrame(df1.iloc[:,4]).T
a.columns = b
df2 = a.append(df1).iloc[0:1]
df2 = df2.reset_index()
df2 = df2.rename({'index' : 'date'}, axis=1).reset_index()
df2 = df2.rename({'index' : 'id'}, axis=1)
df2['date'] = d
df2['id'] = i
df2.columns.names=['']
df2
output:
id date 471362125_NAN 471362121_311470023 471362120_311470024 471362524_312472025 471362122_NAN.
0 1 2020-01-01 29720950 8 9 5 Delivery Issue

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging two multiindex dataframes - python

Use pd.DataFrame.reindex + pd.DataFrame.join reindex has a convenient level parameter that allows you to expand on the index levels not present. df1.join(df2.reindex(df1.index, level=0))

I am not sure if this is what you want but you might need to_frame: f = lambda x: pd.to_datetime(x) df = (df2.merge(df1.unstack().to_frame(), left_index=True, right_index=True).sort_index(level=0)) print(df)

Related

If column value contains symbol, retain only the substring after the symbol

How to manipulate this data frame in pandas python

Increasing a value during merge in pandas

Compare two columns in two dataframes with a condition on another column

Python: Rows to Column Conversion in pandas

Categories

Resources