Related
I have two dataframes:
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
For every 'id' that exists in both dataframes I would like to compute the average of their values in 'total' and have that in a new dataframe.
I tried:
pd.merge(df1, df2, on="id")
with the hope that I could then do:
merged_df[['total']].mean(axis=1)
but it doesn't work at all.
How can you do this?
You could use:
df1.merge(df2, on='id').set_index('id').mean(axis=1).reset_index(name='total')
Or, if you have many columns, a more generic approach:
(df1.merge(df2, on='id', suffixes=(None, '_other')).set_index('id')
.rename(columns=lambda x: x.removesuffix('_other')) # requires python 3.9+
.groupby(axis=1, level=0)
.mean().reset_index()
)
Output:
id total
0 1548638 17.0
1 1953603 38.0
2 1956216 59.0
3 1962245 40.0
4 1981386 40.0
5 1981773 40.0
6 2004787 80.0
7 2017418 44.0
8 2020989 51.0
9 2045043 46.0
You can do like the below:
df1 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
df2 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
merged_df = df1.merge(df2, on='id')
merged_df['total_mean'] = merged_df.filter(regex='total').mean(axis=1)
print(merged_df)
Output:
id total_x total_y total_mean
0 1548638 17 17 17.0
1 1953603 38 38 38.0
2 1956216 59 59 59.0
3 1962245 40 40 40.0
4 1981386 40 40 40.0
5 1981773 40 40 40.0
6 2004787 80 80 80.0
7 2017418 44 44 44.0
8 2020989 51 51 51.0
9 2045043 46 46 46.0
I have below model for which I seek estimation of parameters using pymc3
import pandas as pd
import pymc3 as pm
import arviz as arviz
myData = pd.DataFrame.from_dict({
'Unnamed: 0': {
0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10,
10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20,
20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30,
30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38},
'y': {
0: 0.0079235409492941, 1: 0.0086530073429249, 2: 0.0297400780486734, 3: 0.0196358416326437, 4: 0.0023902064076204, 5: 0.0258055591736283, 6: 0.17394835142698, 7: 0.156463554455613, 8: 0.329388185725557, 9: 0.0076443508881763,
10: 0.0162081480398152, 11: 0.0, 12: 0.0015759139941696, 13: 0.420025972703085, 14: 0.0001226236519444, 15: 0.133061480234834, 16: 0.565454216154227, 17: 0.0002819734812997, 18: 0.000559715156383, 19: 0.0270686389659072,
20: 0.918300537689865, 21: 7.8262468302e-06, 22: 0.0073241434191945, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0, 29: 0.0,
30: 0.174071274611405, 31: 0.0432109713717948, 32: 0.0544400838264943, 33: 0.0, 34: 0.0907049925221286, 35: 0.616680102647887, 36: 0.0, 37: 0.0},
'x': {
0: 23.8187587698947, 1: 15.9991138359515, 2: 33.6495930512881, 3: 28.555818797764, 4: -52.2967967248258, 5: -91.3835208788233, 6: -73.9830692708321, 7: -5.16901145289629, 8: 29.8363012310241, 9: 10.6820057903939,
10: 19.4868517164395, 11: 15.4499668436458, 12: -17.0441644773509, 13: 10.7025053739577, 14: -8.6382953428539, 15: -32.8892974839165, 16: -15.8671863161348, 17: -11.237248036145, 18: -7.37978020066205, 19: -3.33500586334862,
20: -4.02629933182873, 21: -20.2413384726948, 22: -54.9094885578775, 23: -48.041459120976, 24: -52.3125732905322, 25: -35.6269065970458, 26: -62.0296155423529, 27: -49.0825017152659, 28: -73.0574478287598, 29: -50.9409090127938,
30: -63.4650928035253, 31: -55.1263264283842, 32: -52.2841103768755, 33: -61.2275334149805, 34: -74.2175990067417, 35: -68.2961107804698, 36: -76.6834643609286, 37: -70.16769103228}
})
with pm.Model() as myModel :
beta0 = pm.Normal('intercept', 0, 1)
beta1 = pm.Normal('x', 0, 1)
mu = beta0 + beta1 * myData['x'].values
pm.Bernoulli('obs', p = pm.invlogit(mu), observed = myData['y'].values)
with myModel :
calc = pm.sample(50000, tune = 10000, step = pm.Metropolis(), random_seed = 1000)
arviz.summary(calc, round_to = 10)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.537501 0.599667 -3.707061 -1.450243 0.004375 0.003118 18893.344191 22631.772985 1.000070
x 0.033750 0.024314 -0.007871 0.081619 0.000181 0.000133 18550.620475 20113.739639 1.000194
Now I changed above model to this,
mu = beta0 + beta1 * myData['x'].values * 0
With this change I get below result,
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.690874 0.546570 -3.698465 -1.643091 0.003611 0.002565 22980.471424 24806.935727 1.000036
x -0.013861 1.003612 -1.916176 1.826709 0.006874 0.005175 21336.662537 23299.680306 1.000084
I wonder if above estimate is correct. Should not I expect very small estimate for the coefficient beta1? I see hardly any change for this estimate except just change in sign.
Any pointer is highly appreciated.
"hardly any change for this estimate"
Seems like you are ignoring the sd, which has a strong change and is behaving as expected. That is, the first version yields 0.034 ± 0.024 (weakly positive); whereas the second correctly reverts to the prior with -0.014 ± 1.00.
Looking at the input data, none of this seems surprising:
I am trying to sequence an already sequenced dataset based on conditions. The dataframe looks like the following (the intended output is my desired output):
id
activity
duration_sec
cycle
intended_cycle
1
1
start
0.7
1
1
2
1
a
0.3
1
1
3
1
b
0.4
1
1
4
1
c
0.5
1
1
5
1
c
0.5
1
2
6
1
d
0.4
1
2
7
1
e
0.6
1
3
8
1
stop
2
1
3
9
1
start
0.1
2
4
10
1
b
0.3
2
4
11
1
stop
0.2
2
4
12
1
f
0.3
3
5
13
2
stop
40
4
6
14
2
start
3
5
7
15
2
a
0.7
5
7
16
2
a
3
5
7
17
2
b
0.2
5
7
18
2
stop
0.2
5
7
19
2
start
0.1
6
8
20
2
f
0.4
6
8
21
2
g
0.2
6
8
22
2
h
0.5
6
8
23
2
h
6
6
8
24
2
stop
9
6
8
25
2
start
0.2
7
9
26
2
e
0.3
7
9
27
2
f
0.4
7
10
28
2
stop
0.7
7
10
The alphabets are representative of activity names. I'm hoping to sequence as per the intended cycle column. This is based on the condition that within the current sequence:
the first activity duration>0.5s after the "start", would be considered as 1 sequence (or a sub-sequence, if you will).
the duplicate activities occurring consecutively will be considered as one if either value is >0.5s
the next activity that is >0.5s, after the first>0.5s would be considered as the next sequence
if there is no more activity >0.5s after, the sequence will be considered till the cycle column value changes
if there is no activity within the current cycle that is >0.5s, the activities will be considered independent (i.e. row 25-28)
I'd really appreciate any help. Big thanks to there being a community for this!
sample code for df:
data = {'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28},
'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2},
'activity': {0: 'start', 1: 'a', 2: 'b', 3: 'c', 4: 'c', 5: 'd', 6: 'e', 7: 'stop', 8: 'start', 9: 'b', 10: 'stop', 11: 'f', 12: 'stop', 13: 'start', 14: 'a', 15: 'a', 16: 'b', 17: 'stop', 18: 'start', 19: 'f', 20: 'g', 21: 'h', 22: 'h', 23: 'stop', 24: 'start', 25: 'e', 26: 'f', 27: 'stop'},
'duration_sec': {0: 0.7, 1: 0.3, 2: 0.4, 3: 0.5, 4: 0.5, 5: 0.4, 6: 0.6, 7: 2.0, 8: 0.1, 9: 0.3, 10: 0.2, 11: 0.3, 12: 40.0, 13: 3.0, 14: 0.7, 15: 3.0, 16: 0.2, 17: 0.2, 18: 0.1, 19: 0.4, 20: 0.2, 21: 0.5, 22: 6.0, 23: 9.0, 24: 0.2, 25: 0.3, 26: 0.4, 27: 0.7}, 'cycle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 3, 12: 4, 13: 5, 14: 5, 15: 5, 16: 5, 17: 5, 18: 6, 19: 6, 20: 6, 21: 6, 22: 6, 23: 6, 24: 7, 25: 7, 26: 7, 27: 7}, 'intended_cycle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4, 10: 4, 11: 5, 12: 6, 13: 7, 14: 7, 15: 7, 16: 7, 17: 7, 18: 8, 19: 8, 20: 8, 21: 8, 22: 8, 23: 8, 24: 9, 25: 9, 26: 10, 27: 10}}
df = pd.DataFrame.from_dict(data)
The nn_idx_df contains index values that match the index of xyz_df. How can I get the values from column H in xyz_df and create new columns in nn_idx_df to match the result illustrated in output_df. I could hack my way through this, but would like to see a pandorable solution.
nn_idx_df = pd.DataFrame({'nn_1_idx': {0: 65, 1: 7, 2: 18},
'nn_2_idx': {0: 64, 1: 9, 2: 64},
'nn_3_idx': {0: 69, 1: 67, 2: 68},
'nn_4_idx': {0: 75, 1: 13, 2: 65},
'nn_5_idx': {0: 70, 1: 66, 2: 1}})
print(nn_idx_df)
nn_1_idx nn_2_idx nn_3_idx nn_4_idx nn_5_idx
0 65 64 69 75 70
1 7 9 67 13 66
2 18 64 68 65 1
xyz_df = pd.DataFrame({'X': {1: 6401652.35,
7: 6401845.46,
9: 6401671.93,
13: 6401868.98,
18: 6401889.78,
64: 6401725.71,
65: 6401663.04,
66: 6401655.89,
67: 6401726.33,
68: 6401755.92,
69: 6401755.23,
70: 6401766.23,
75: 6401825.9},
'Y': {1: 1858548.15,
7: 1858375.68,
9: 1858490.83,
13: 1858403.79,
18: 1858423.25,
64: 1858579.25,
65: 1858570.3,
66: 1858569.97,
67: 1858607.8,
68: 1858581.58,
69: 1858591.46,
70: 1858517.48,
75: 1858420.72},
'Z': {1: 467.62,
7: 482.22,
9: 459.15,
13: 485.17,
18: 488.35,
64: 488.88,
65: 465.75,
66: 467.35,
67: 486.12,
68: 490.12,
69: 490.68,
70: 483.96,
75: 467.39},
'H': {1: 47.8791,
7: 45.5502,
9: 46.0995,
13: 41.9554,
18: 41.0537,
64: 47.1215,
65: 46.0047,
66: 45.936,
67: 40.5807,
68: 37.8478,
69: 37.1639,
70: 37.2314,
75: 25.8446}})
print(xyz_df)
X Y Z H
1 6401652.35 1858548.15 467.62 47.8791
7 6401845.46 1858375.68 482.22 45.5502
9 6401671.93 1858490.83 459.15 46.0995
13 6401868.98 1858403.79 485.17 41.9554
18 6401889.78 1858423.25 488.35 41.0537
64 6401725.71 1858579.25 488.88 47.1215
65 6401663.04 1858570.30 465.75 46.0047
66 6401655.89 1858569.97 467.35 45.9360
67 6401726.33 1858607.80 486.12 40.5807
68 6401755.92 1858581.58 490.12 37.8478
69 6401755.23 1858591.46 490.68 37.1639
70 6401766.23 1858517.48 483.96 37.2314
75 6401825.90 1858420.72 467.39 25.8446
output_df = pd.DataFrame(
{'nn_1_idx': {0: 65, 1: 7, 2: 18},
'nn_2_idx': {0: 64, 1: 9, 2: 64},
'nn_3_idx': {0: 69, 1: 67, 2: 68},
'nn_4_idx': {0: 75, 1: 13, 2: 65},
'nn_5_idx': {0: 70, 1: 66, 2: 1},
'nn_1_idx_h': {0: 46.0047, 1: 45.5502, 2: 41.0537},
'nn_2_idx_h': {0: 47.1215, 1: 46.0995, 2: 47.1215},
'nn_3_idx_h': {0: 37.1639, 1:40.5807, 2: 37.8478},
'nn_4_idx_h': {0: 25.8446, 1: 41.9554, 2: 46.0047},
'nn_5_idx_h': {0: 37.2314, 1: 45.9360, 2: 47.8791}})
print(output_df)
nn_1_idx nn_2_idx nn_3_idx nn_4_idx nn_5_idx nn_1_idx_h nn_2_idx_h nn_3_idx_h nn_4_idx_h nn_5_idx_h
0 65 64 69 75 70 46.0047 47.1215 37.1639 25.8446 37.2314
1 7 9 67 13 66 45.5502 46.0995 40.5807 41.9554 45.9360
2 18 64 68 65 1 41.0537 47.1215 37.8478 46.0047 47.8791
Let us do replace with join
df=nn_idx_df.join(nn_idx_df.replace(xyz_df.H).add_suffix('_h'))
df
nn_1_idx nn_2_idx nn_3_idx ... nn_3_idx_h nn_4_idx_h nn_5_idx_h
0 65 64 69 ... 37.1639 25.8446 37.2314
1 7 9 67 ... 40.5807 41.9554 45.9360
2 18 64 68 ... 37.8478 46.0047 47.8791
[3 rows x 10 columns]
+------+------+------+------+------+------+-------+----+
| | | | | USD | EUR | JPY | RUP |
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| | | | | Case | Cons | Case | Case|
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| | | | | High | Low | CWM | AEP |
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| Col1 | Col2 | Col3 | Col4 | Owner| OPS | VH |Delta|
+------+------+------+------+------+------+------+-----+
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 |
| V1a | V2a | V3a | V4a | V5a | V6a | V7a | V8a |
+------+------+------+------+------+------+------+-----+
as requested here is the sample data as output by df.to_dict():
{('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2', 'Year'): {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020, 5: 2020, 6: 2020, 7: 2020, 8: 2020, 9: 2020, 10: 2020, 11: 2020, 12: 2020, 13: 2020, 14: 2020, 15: 2020, 16: 2020, 17: 2020, 18: 2020, 19: 2020, 20: 2020, 21: 2020, 22: 2020, 23: 2020, 24: 2020, 25: 2020, 26: 2020, 27: 2020, 28: 2020, 29: 2020, 30: 2020, 31: 2020, 32: 2020, 33: 2020, 34: 2020, 35: 2020, 36: 2020, 37: 2020, 38: 2020, 39: 2020, 40: 2020, 41: 2020, 42: 2020, 43: 2020, 44: 2020, 45: 2020, 46: 2020, 47: 2020}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'Unnamed: 1_level_2', 'Month'): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 1, 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'Unnamed: 2_level_2', 'Day'): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 2, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2, 30: 2, 31: 2, 32: 2, 33: 2, 34: 2, 35: 2, 36: 2, 37: 2, 38: 2, 39: 2, 40: 2, 41: 2, 42: 2, 43: 2, 44: 2, 45: 2, 46: 2, 47: 2}, ('Unnamed: 3_level_0', 'Unnamed: 3_level_1', 'Unnamed: 3_level_2', 'Hour'): {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 0, 25: 1, 26: 2, 27: 3, 28: 4, 29: 5, 30: 6, 31: 7, 32: 8, 33: 9, 34: 10, 35: 11, 36: 12, 37: 13, 38: 14, 39: 15, 40: 16, 41: 17, 42: 18, 43: 19, 44: 20, 45: 21, 46: 22, 47: 23}, ('USD', 'Cons', 'very high', 'Hub1'): {0: 23.06, 1: 21.49, 2: 21.73, 3: 21.58, 4: 21.67, 5: 22.78, 6: 27.15, 7: 26.09, 8: 26.23, 9: 28.21, 10: 29.21, 11: 31.97, 12: 30.45, 13: 30.45, 14: 30.45, 15: 29.14, 16: 28.28, 17: 26.35, 18: 26.32, 19: 27.01, 20: 26.34, 21: 28.22, 22: 27.77, 23: 26.94, 24: 24.16, 25: 22.74, 26: 22.67, 27: 22.67, 28: 22.74, 29: 23.14, 30: 27.81, 31: 27.87, 32: 28.05, 33: 27.91, 34: 32.66, 35: 35.14, 36: 33.32, 37: 36.17, 38: 38.33, 39: 31.75, 40: 30.9, 41: 26.36, 42: 27.17, 43: 28.17, 44: 26.17, 45: 26.5, 46: 28.95, 47: 26.94}, ('EUR', 'Case', 'CWM', 'Hub2'): {0: 18.59, 1: 18.32, 2: 18.32, 3: 18.32, 4: 18.32, 5: 19.19, 6: 22.57, 7: 25.38, 8: 25.53, 9: 25.9, 10: 26.47, 11: 26.47, 12: 26.09, 13: 25.59, 14: 25.35, 15: 24.97, 16: 24.22, 17: 25.22, 18: 25.49, 19: 26.19, 20: 25.63, 21: 25.1, 22: 21.93, 23: 19.61, 24: 19.4, 25: 18.75, 26: 18.85, 27: 18.75, 28: 18.88, 29: 19.41, 30: 23.97, 31: 27.07, 32: 27.23, 33: 29.21, 34: 30.49, 35: 28.52, 36: 27.49, 37: 26.93, 38: 26.71, 39: 25.76, 40: 25.24, 41: 25.67, 42: 26.72, 43: 27.98, 44: 26.73, 45: 25.97, 46: 22.34, 47: 19.47}, ('USD', 'Cons', 'Ventyx', 'Hub3'): {0: 19.78, 1: 20.96, 2: 21.58, 3: 21.5, 4: 21.27, 5: 22.59, 6: 26.22, 7: 26.78, 8: 26.78, 9: 26.97, 10: 26.97, 11: 26.97, 12: 26.53, 13: 26.34, 14: 26.5, 15: 26.22, 16: 25.6, 17: 26.5, 18: 26.74, 19: 27.44, 20: 26.87, 21: 26.5, 22: 23.2, 23: 23.58, 24: 22.74, 25: 22.31, 26: 22.27, 27: 22.27, 28: 22.74, 29: 22.84, 30: 27.79, 31: 31.63, 32: 29.6, 33: 29.25, 34: 30.53, 35: 28.51, 36: 27.48, 37: 26.97, 38: 26.74, 39: 26.53, 40: 26.5, 41: 26.92, 42: 28.89, 43: 30.24, 44: 28.38, 45: 27.38, 46: 24.39, 47: 23.2}}
That is about as good a representation as I can make for this file.
Columns 1-4 have a single header Columns 5-N (yes N, because we don't know how many) have 4 headers.
The dataframe needs to look like this:
+------+------+------+------+------+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 | NCol1| NCol2|NCol3 | NCol4| Col9 |
+------+------+------+------+------+------+------+------+------+
| V1 | V2 | V3 | V4 | USD | Case | High | Owner| V5 |
| V1a | V2a | V3a | V4a | USD | Case | High | Owner| V5a |
| V1a | V2a | V3a | V4a | EUR | Cons | Low | Ops | V6 |
| V1a | V2a | V3a | V4a | EUR | Cons | Low | Ops | V6a |
| V1a | V2a | V3a | V4a | JPY | Case | CWM | VH | V7 |
| V1a | V2a | V3a | V4a | JPY | Case | CWM | VH | V7a |
| V1a | V2a | V3a | V4a | RUP | Case | AEP | Delta| V8 |
| V1a | V2a | V3a | V4a | RUP | Case | AEP | Delta| V8a |
+------+------+------+------+------+------+-----+------+-------+
So essentially pivot the 5th through N column headers into new columns where each row of data is aligned with the first 4 columns and the headers the values were originally under.
I tried:
df = pd.read_csv(file,header=[0,1,2,3])
df.melt(var_name=['a','b','c','d'], value_name='e')
Also:
df2 = df.melt(id_vars=['Year','Month','Day','Hour'], col_level=3)
And :
df2 = df.stack().stack().stack().stack()
That last one is ver close, but it does the first 4 columns
However that doesn't work as it gives me just col1 and col2.
I feel like I am shooting in the dark but this is what i could pull out. lemme know if it is not what you want. if it isn't, kindly post a sample output based on the dict you posted so others can butt in. and i'll gladly delete this hack
df = pd.DataFrame(sample)
df.columns =df.columns.to_flat_index()
df.columns = ['_'.join(i) for i in df.columns]
df = df.melt(id_vars=['Unnamed: 0_level_0_Unnamed: 0_level_1_Unnamed:
0_level_2_Year',
'Unnamed: 1_level_0_Unnamed: 1_level_1_Unnamed: 1_level_2_Month',
'Unnamed: 2_level_0_Unnamed: 2_level_1_Unnamed: 2_level_2_Day',
'Unnamed: 3_level_0_Unnamed: 3_level_1_Unnamed: 3_level_2_Hour'])
df.columns = [i.split('_')[-1] for i in df.columns]
pd.concat([df,df.variable.str.split('_',expand=True)],axis=1)