How to manipulate this data frame in pandas python - python

I am trying the take values from a data frame and make it into another. It is hard to explain what I am doing but I have made an example below. Please can someone help as I have lots of columns of I would like to reduce to a few. I want to end up with matrix pd.concat([df1,df2]). from df.
Pre is a factor with 2 levels, 0forPOST or 1 for PRE, SPC is a factor with many levels.
Thank you
df = pd.DataFrame({'CPDID': {0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE_SPC1': {0: 'NaN', 1: 'NaN', 2: 'NaN'},
'POST_SPC2': {0:10, 1:50, 2:80},
'POST_SPC3': {0:30, 1:40, 2:10}})
df1 = pd.DataFrame({'CPDID':{0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE': {0: 1, 1: 1, 2: 1},
'SPC': {0:1, 1:1, 2:1},
'Damage': {0:'NaN', 1:'NaN', 2:'NaN'}})
df2 = pd.DataFrame({'CPDID':{0: 'C1', 1: 'C2', 2: 'C2'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE': {0: 0, 1: 0, 2: 0},
'SPC': {0:2, 1:2, 2:2},
'Damage': {0:10, 1:50, 2:80}})
print(df)
print(pd.concat([df1,df2]))
print(df)
print(pd.concat([df1,df2]))

The core step is to transform the dataframe by .stack(). However, your desired dataframe requires quite a few steps to transform and extract column label values from the base df, as follows:
df = pd.DataFrame({'CPDID': {0: 'C1', 1: 'C2', 2: 'C3'},
'Rate': {0: 100, 1: 500, 2: 200},
'PRE_SPC1': {0: 'NaN', 1: 'NaN', 2: 'NaN'},
'POST_SPC2': {0:10, 1:50, 2:80},
'POST_SPC3': {0:30, 1:40, 2:10}})
df_out = df.set_index(['CPDID', 'Rate'])
# split 'PRE'/'POST' from 'SPCn' from column labels
df_out.columns = df_out.columns.str.split('_', expand=True)
# prepare for column name `PRE', 'SPC' for the related columns
df_out = df_out.rename_axis(('PRE', 'SPC'), axis=1)
# main step to transform df by stacking and name the values as 'Damage'
df_out = df_out.stack(level=[0,1]).reset_index(name='Damage')
# transform the values of 'PRE'
df_out['PRE'] = df_out['PRE'].eq('PRE').astype(int)
# extract number from 'SPCn'
df_out['SPC'] = df_out['SPC'].str.extract(r'(\d)$')
# sort to the required sequence
df_out = df_out.sort_values('SPC', ignore_index=True)
Result:
print(df_out)
CPDID Rate PRE SPC Damage
0 C1 100 1 1 NaN
1 C2 500 1 1 NaN
2 C3 200 1 1 NaN
3 C1 100 0 2 10.0
4 C2 500 0 2 50.0
5 C3 200 0 2 80.0
6 C1 100 0 3 30.0
7 C2 500 0 3 40.0
8 C3 200 0 3 10.0

Related

Merging two multiindex dataframes

I have 2 dataframes:
df1 = pd.DataFrame.from_dict({('category', ''): {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
(pd.Timestamp('2021-06-28 00:00:00'),
'metric_1'): {0: 4120.549999999999, 1: 11226.016666666665, 2: 25049.443333333333, 3: 18261.083333333332, 4: 2553.1208333333334, 5: 2843.01, 6: 73203.51333333334},
(pd.Timestamp('2021-06-28 00:00:00'), 'metric_2'): {0: 9907.79,
1: 7614.650000000001,
2: 13775.259999999998,
3: 13158.250000000004,
4: 1457.85,
5: 1089.5600000000002,
6: 38864.9},
(pd.Timestamp('2021-07-05 00:00:00'),
'metric_1'): {0: 5817.319999999998, 1: 10799.45, 2: 23521.51, 3: 22062.350833333334, 4: 1249.5974999999999, 5: 3229.77, 6: 52796.06083333332},
(pd.Timestamp('2021-07-05 00:00:00'), 'metric_2'): {0: 6321.21,
1: 5606.01,
2: 10239.689999999999,
3: 17476.600000000002,
4: 943.7199999999999,
5: 1410.33,
6: 29645.45}}).set_index('category')
df2 = pd.DataFrame.from_dict({'category': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
1: {0: 36234.035577957984,
1: 69078.07089184562,
2: 128879.5397517309,
3: 178376.63536908248,
4: 9293.956915067887,
5: 8184.780211399392,
6: 177480.74540313095},
2: {0: 37887.581678419825,
1: 72243.67956241772,
2: 134803.02342121338,
3: 186603.8963173654,
4: 9716.385738295368,
5: 8555.606693927,
6: 185658.87577993725}}).set_index('category')
First I change the column names of df2 to be the same as df
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
df2 = df2.rename(columns=date_mappings)
Then I try to merge it
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack(), left_index=True, right_index=True).sort_index(axis=1))
But I get an error:
ValueError: Cannot merge a Series without a name
What is my mistake?
My goal is to add columns from df2 to df1 in each week like so that df1 would have 3 columns instead of 2.
After using
c = [df2.columns.map(date_mappings.get), df2.columns]
df1.join(df2.set_axis(c, axis=1)).sort_index(axis=1)
I get the values appended to the end of the dataframe rather than to the same columns with the same week naming:
Maybe this could be an issue that df2 holds dates from 2021-06-28 to 2022-06-27 while df1 holds dates from 2020 to today.
Unwanted adding to the end of the df
Idea is create MultiIndex in both DataFrames:
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
#create MultiIndex in df2 with datetimes in first level
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
#removed unused levels, here category, so possible convert first leve to datetimes
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05 \
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45
metric_3
category
A 37887.581678
B 72243.679562
C 134803.023421
D 186603.896317
E 9716.385738
F 8555.606694
G 185658.875780
If need remove greater datetimes like maximal df1 datetimes use:
#change mapping for test
date_mappings = {
1 : '2021-06-28',
2 : '2022-07-05'}
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
df2 = df2.loc[:, df2.columns.get_level_values(0) <= df1.columns.get_level_values(0).max()]
print (df2)
2021-06-28
metric_3
category
A 36234.035578
B 69078.070892
C 128879.539752
D 178376.635369
E 9293.956915
F 8184.780211
G 177480.745403
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45
Use pd.DataFrame.reindex + pd.DataFrame.join
reindex has a convenient level parameter that allows you to expand on the index levels not present.
df1.join(df2.reindex(df1.index, level=0))
I am not sure if this is what you want but you might need to_frame:
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack().to_frame(), left_index=True, right_index=True).sort_index(level=0))
print(df)

Compare two data frames side by side for same columns but different rows

I have two data frames with same column labels like below:
df1 = {'key_1': {0: 'F', 1: 'H', 2: 'E'},
'key_2': {0: 'F', 1: 'G', 2: 'E'},
'min': {0: -158, 1: -881, 2: -674},
'count': {0: 58, 1: 24, 2: 13}}
df2 = {'key_1': {0: 'C', 1: 'L', 2: 'F', 3: 'K'},
'key_2': {0: 'C', 1: 'D', 2: 'F', 3: 'K'},
'min': {0: -452, 1: -153, 2: -181, 3: -120},
'count': {0: 7470, 1: 1262, 2: 171, 3: 86}}
pandas.DataFrame.compare is useful for side by side comparison of each column, but it is not working for comparing data frames with different rows
df1.compare(df2, keep_shape=True, keep_equal=True)
ValueError: Can only compare identically-labeled DataFrame objects
can we achieve the same functionality using pandas.merge ?
I tried below but it is NOT giving side by side comparison for each corresponding column
pd.merge(df1,df2, on=['key_1','key_2'], suffixes=['_df1','_df2'], how='outer')
key_1 key_2 min_df1 count_df1 min_df2 count_df2
0 F F -158.0 58.0 -181.0 171.0
1 H G -881.0 24.0 NaN NaN
2 E E -674.0 13.0 NaN NaN
3 C C NaN NaN -452.0 7470.0
4 L D NaN NaN -153.0 1262.0
5 K K NaN NaN -120.0 86.0
Use concat with convert ['key_1','key_2'] to MultiIndex:
df = (pd.concat([df1.set_index(['key_1','key_2']),
df2.set_index(['key_1','key_2'])], keys=['df1','df2'], axis=1)
.sort_index(level=1, axis=1))
print (df)
df1 df2 df1 df2
count count min min
key_1 key_2
C C NaN 7470.0 NaN -452.0
E E 13.0 NaN -674.0 NaN
F F 58.0 171.0 -158.0 -181.0
H G 24.0 NaN -881.0 NaN
K K NaN 86.0 NaN -120.0
L D NaN 1262.0 NaN -153.0
After the merge, ou can re-order the columns alphabetically in order to have them side by side:
first_columns = ['key_1','key_2']
merged_df = pd.merge(df1,df2, on=['key_1','key_2'], suffixes=['_df1','_df2'], how='outer')
merged_df = merged_df[first_columns + sorted([col for col in merged_df.columns if col not in first_columns ])]
One way:
merged_df = pd.merge(df1, df2, on=['key_1', 'key_2'], suffixes=[
'_df1', '_df2'], how='outer').set_index(['key_1', 'key_2'])
merged_df.columns = merged_df.columns.str.split('_', expand=True)
merged_df = merged_df.sort_index(level=0, axis=1)

Increasing a value during merge in pandas

I have 2 dataframes
df1
product_id value name
abc 10 a
def 20 b
ggg 10 c
df2
Which I get after using df2.groupby(['prod_id'])['code'].count().reset_index()
prod_id code
abc 10
def 20
ggg 10
ooo 5
hhh 1
I want to merge values from df2 to df1 left on product_id, right on prod_id.
To get:
product_id value name
abc 20 a
def 40 b
ggg 20 c
I tried:
pd.merge(df1, df2.groupby(['prod_id'])['code'].count().reset_index(),
left_on='product_id', right_on='prod_id', how='left')
Which returns df1 with 2 additional columns prod_id and code with the code column holding the amount by which I would like to increase value in df1. Now I can just add those 2 columns but I would like to avoid that.
Here’s one alternative:
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
Complete example:
df1 = pd.DataFrame({'product_id': {0: 'abc', 1: 'def', 2: 'ggg'},
'value': {0: 10, 1: 20, 2: 10},
'name': {0: 'a', 1: 'b', 2: 'c'}})
df2 = pd.DataFrame({'prod_id': {0: 'abc', 1: 'def', 2: 'ggg', 3: 'ooo', 4: 'hhh'},
'code': {0: 10, 1: 20, 2: 10, 3: 5, 4: 1}})
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
OUTPUT:
product_id value name
0 abc 20 a
1 def 40 b
2 ggg 20 c
you could use reindex on df2 with the order of df1 product_id, after the groupby.count (without the reset_index). like
df1['value'] += (
df2.groupby(['prod_id'])
['code'].count()
.reindex(df1['product_id'], fill_value=0)
.to_numpy()
)

Python: Rows to Column Conversion in pandas

Need help in converting rows answers to columns in python.Given below is sample dataset.
Thanks for Help..
ID| date |question_id |Choice_id| answer
1 | 2020-01-01 | 471362125 |NAN | 29720950
2 | 2020-01-01 | 471362121 |311470023| 8
3 | 2020-01-01 | 471362120 |311470024| 9
4 | 2020-01-01 | 471362524 |312472025| 5
5 | 2020-01-01 | 471362122 |NAN. | Delivery Issue
Expected output
id|date|471362125_nan|471362121_311470023|471362120_311470024|471362524_312472025|471362122_NAN
1 | 2020-01-01| 29720950|8|9|5|Delivery Issue
I will rename this with question text using rename in pandas
You could do it with the brute force technique using a lot of iloc, setting the column names appending and resetting the index. The main idea is that the column names and first row mainly come from two columns, so you append those together horizontally:
input:
import pandas as pd
df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'date': {0: '2020-01-01',
1: '2020-01-01',
2: '2020-01-01',
3: '2020-01-01',
4: '2020-01-01'},
'question_id': {0: 471362125,
1: 471362121,
2: 471362120,
3: 471362524,
4: 471362122},
'Choice_id': {0: 'NAN',
1: '311470023',
2: '311470024',
3: '312472025',
4: 'NAN.'},
'answer': {0: '29720950', 1: '8', 2: '9', 3: '5', 4: 'Delivery Issue'}})
code:
df1 = df.copy()
d = df1['date'].min()
i = df1['ID'].min()
df1.columns = df1['question_id'].astype(str) + '_' + df1['Choice_id'].astype(str).to_list()
b = df1.columns.to_list()
a = pd.DataFrame(df1.iloc[:,4]).T
a.columns = b
df2 = a.append(df1).iloc[0:1]
df2 = df2.reset_index()
df2 = df2.rename({'index' : 'date'}, axis=1).reset_index()
df2 = df2.rename({'index' : 'id'}, axis=1)
df2['date'] = d
df2['id'] = i
df2.columns.names=['']
df2
output:
id date 471362125_NAN 471362121_311470023 471362120_311470024 471362524_312472025 471362122_NAN.
0 1 2020-01-01 29720950 8 9 5 Delivery Issue

How to turn pandas data frame w/ row per unique term & date into df which contains 1 row per date, w/ unique terms + their values as columns?

I have a csv, or data frame, that looks something along the lines of this, but includes several hundred thousand rows:
df = {'Date': {0: '2014-01-01,
1: '2014-01-01',
2: '2014-01-01',
3: '2014-01-02',
4: '2014-01-02'},
'Name': {0: 'John',
1: 'John',
2: 'Rob',
3: 'Mel',
4: 'Rob'},
'Rank': {0: 1, 1: 3, 2: 2, 3: 5, 4: 6},
'Count': {0: 10, 1: 3, 2: 9, 3: 11, 4: 4}}
The names recur for each of the dates, but change in count and ranking. Instead of having one row per date for every single one of these names, as I do now, I'd like to arrange my data frame so that there is a value for every date. That is to say, I'd like my table to look like this:
Date John_Rank Rob_Rank Mel_rank John_count Mel_count Rob_count
2014-01-01 ... ... ... ... ...
2014-01-02 ... ... ... ... ...
I'd like to use this format to calculate the differences in ranks. I've come up against this several times before, but haven't had this many rows to deal with for a long stretch of dates — I've only done this manually up until now. Any advice would be very much appreciated!!
I think you can use pivot_table with default aggfunc='mean':
import pandas as pd
d = {'Date': {0: '2014-01-01',
1: '2014-01-01',
2: '2014-01-01',
3: '2014-01-02',
4: '2014-01-02'},
'Name': {0: 'John',
1: 'John',
2: 'Rob',
3: 'Mel',
4: 'Rob'},
'Rank': {0: 1, 1: 3, 2: 2, 3: 5, 4: 6},
'Count': {0: 10, 1: 3, 2: 9, 3: 11, 4: 4}}
df = pd.DataFrame(d)
print df
Count Date Name Rank
0 10 2014-01-01 John 1
1 3 2014-01-01 John 3
2 9 2014-01-01 Rob 2
3 11 2014-01-02 Mel 5
4 4 2014-01-02 Rob 6
df = pd.pivot_table(df, index='Date', columns='Name')
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print df
Count_John Count_Mel Count_Rob Rank_John Rank_Mel Rank_Rob
Date
2014-01-01 6.5 NaN 9 2 NaN 2
2014-01-02 NaN 11 4 NaN 5 6
Or if you want swaplevel multiindex in columns:
df = pd.pivot_table(df, index='Date', columns='Name')
df.columns = df.columns.swaplevel(0,1)
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print df
John_Count Mel_Count Rob_Count John_Rank Mel_Rank Rob_Rank
Date
2014-01-01 6.5 NaN 9 2 NaN 2
2014-01-02 NaN 11 4 NaN 5 6

Categories