deleting rows that have same date time

deleting rows that have same date time - python

How can I get only row with the same minute? the seconds value do no matter. It seem like the row can be deleted using something like df.drop(index=2), but the data are so many to be deleted one over one.
import json
import math
from pandas.io.json import json_normalize
import pandas as pd
a=open(r'C:\work\kenkyuu\FITBIT\MyFitbitData (4)\AswadMdnor\user-site-export\heart_rate-2019-11-
17.json')
b=json.load(a)
df = json_normalize(b)
df = df.rename(columns={'value.bpm':'bpm','value.confidence':'confidence'})
print(df)
dateTime bpm confidence
11/17/19 02:28:05 113 0
11/17/19 02:28:17 70 0
11/17/19 02:28:31 70 0
11/17/19 02:28:42 70 0
11/17/19 02:29:29 70 0
11/17/19 02:29:46 70 0
11/17/19 02:30:43 70 0
11/17/19 02:32:13 70 0
11/17/19 02:49:39 70 0
i hope for this output:
dateTime bpm confidence
11/17/19 02:28:05 113 0
11/17/19 02:29:29 70 0
11/17/19 02:30:43 70 0
11/17/19 02:32:13 70 0
11/17/19 02:49:39 70 0
Here is the data as a dictionary, which you can use to recreate the DataFrame:
{'dateTime': {0: '11/17/19 02:28:05', 1: '11/17/19 02:28:17', 2: '11/17/19 02:28:31', 3: '11/17/19 02:28:42', 4: '11/17/19 02:29:29', 5: '11/17/19 02:29:46', 6: '11/17/19 02:30:43', 7: '11/17/19 02:32:13', 8: '11/17/19 02:49:39', 9: '11/17/19 02:49:49', 10: '11/17/19 02:49:54', 11: '11/17/19 02:49:59', 12: '11/17/19 02:50:04', 13: '11/17/19 02:50:09', 14: '11/17/19 02:50:14', 15: '11/17/19 02:50:24', 16: '11/17/19 02:50:29', 17: '11/17/19 02:50:34', 18: '11/17/19 02:50:39', 19: '11/17/19 02:50:44', 20: '11/17/19 02:50:49', 21: '11/17/19 02:51:04', 22: '11/17/19 02:51:09', 23: '11/17/19 03:04:05', 24: '11/17/19 03:04:33', 25: '11/17/19 11:14:27', 26: '11/17/19 11:14:42', 27: '11/17/19 11:14:52', 28: '11/17/19 11:15:01', 29: '11/17/19 11:15:06', 30: '11/17/19 11:15:21'}, 'bpm': {0: 113, 1: 70, 2: 70, 3: 70, 4: 70, 5: 70, 6: 70, 7: 70, 8: 70, 9: 67, 10: 62, 11: 57, 12: 58, 13: 60, 14: 60, 15: 62, 16: 63, 17: 65, 18: 66, 19: 67, 20: 65, 21: 66, 22: 67, 23: 69, 24: 70, 25: 70, 26: 70, 27: 70, 28: 70, 29: 70, 30: 70}, 'confidence': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 1, 10: 1, 11: 2, 12: 2, 13: 2, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 0, 24: 0, 25: 0, 26: 0, 27: 1, 28: 1, 29: 0, 30: 1}}

I will round seconds then check duplicates and then subset or drop duplicates of the rounded datetime
df[~df['dateTime'].dt.round('min').duplicated()]

Here, we are dropping the duplicates by ignoring the seconds and take it's index values, to get original time with seconds as shown below.
>>> df.iloc[df['dateTime'].astype(str).str[:-2].drop_duplicates(keep='first').index,:]
Output:
dateTime bpm confidence
0 11/17/19 02:28:05 113 0
4 11/17/19 02:29:29 70 0
6 11/17/19 02:30:43 70 0
7 11/17/19 02:32:13 70 0
8 11/17/19 02:49:39 70 0

I believe this solution is the most idiomatic one, although I will keep searching.
import pandas as pd
df = pd.read_csv('../resources/fitbit_time_data.csv', dtype={'bpm': np.int64, 'confidence': np.int64}, parse_dates=['date_time'], names=['date_time', 'bpm', 'confidence'], skiprows=[0])
df = df.resample(rule='min', on='date_time').first().dropna().reset_index(drop=True)
Result:
date_time bpm confidence
0 2019-11-17 02:28:05 113.0 0.0
1 2019-11-17 02:29:29 70.0 0.0
2 2019-11-17 02:30:43 70.0 0.0
3 2019-11-17 02:32:13 70.0 0.0
4 2019-11-17 02:49:39 70.0 0.0
import pandas as pd
df = pd.read_csv('../resources/fitbit_time_data.csv', dtype={'bpm': np.int64, 'confidence': np.int64}, parse_dates=['date_time'], names=['date_time', 'bpm', 'confidence'], skiprows=[0])
df['minute'] = df.set_index('date_time').index.minute
df = df.loc[df['minute'].shift() != df['minute']]
DataFrame result:
date_time bpm confidence minute
0 2019-11-17 02:28:05 113 0 28
4 2019-11-17 02:29:29 70 0 29
6 2019-11-17 02:30:43 70 0 30
7 2019-11-17 02:32:13 70 0 32
8 2019-11-17 02:49:39 70 0 49

Related

How to average across two dataframes

I have two dataframes:
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
For every 'id' that exists in both dataframes I would like to compute the average of their values in 'total' and have that in a new dataframe.
I tried:
pd.merge(df1, df2, on="id")
with the hope that I could then do:
merged_df[['total']].mean(axis=1)
but it doesn't work at all.
How can you do this?

You could use:
df1.merge(df2, on='id').set_index('id').mean(axis=1).reset_index(name='total')
Or, if you have many columns, a more generic approach:
(df1.merge(df2, on='id', suffixes=(None, '_other')).set_index('id')
.rename(columns=lambda x: x.removesuffix('_other')) # requires python 3.9+
.groupby(axis=1, level=0)
.mean().reset_index()
)
Output:
id total
0 1548638 17.0
1 1953603 38.0
2 1956216 59.0
3 1962245 40.0
4 1981386 40.0
5 1981773 40.0
6 2004787 80.0
7 2017418 44.0
8 2020989 51.0
9 2045043 46.0

You can do like the below:
df1 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
df2 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
merged_df = df1.merge(df2, on='id')
merged_df['total_mean'] = merged_df.filter(regex='total').mean(axis=1)
print(merged_df)
Output:
id total_x total_y total_mean
0 1548638 17 17 17.0
1 1953603 38 38 38.0
2 1956216 59 59 59.0
3 1962245 40 40 40.0
4 1981386 40 40 40.0
5 1981773 40 40 40.0
6 2004787 80 80 80.0
7 2017418 44 44 44.0
8 2020989 51 51 51.0
9 2045043 46 46 46.0

Strange result from pymc3 for small change in the variable

I have below model for which I seek estimation of parameters using pymc3
import pandas as pd
import pymc3 as pm
import arviz as arviz
myData = pd.DataFrame.from_dict({
'Unnamed: 0': {
0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10,
10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20,
20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30,
30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38},
'y': {
0: 0.0079235409492941, 1: 0.0086530073429249, 2: 0.0297400780486734, 3: 0.0196358416326437, 4: 0.0023902064076204, 5: 0.0258055591736283, 6: 0.17394835142698, 7: 0.156463554455613, 8: 0.329388185725557, 9: 0.0076443508881763,
10: 0.0162081480398152, 11: 0.0, 12: 0.0015759139941696, 13: 0.420025972703085, 14: 0.0001226236519444, 15: 0.133061480234834, 16: 0.565454216154227, 17: 0.0002819734812997, 18: 0.000559715156383, 19: 0.0270686389659072,
20: 0.918300537689865, 21: 7.8262468302e-06, 22: 0.0073241434191945, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0, 29: 0.0,
30: 0.174071274611405, 31: 0.0432109713717948, 32: 0.0544400838264943, 33: 0.0, 34: 0.0907049925221286, 35: 0.616680102647887, 36: 0.0, 37: 0.0},
'x': {
0: 23.8187587698947, 1: 15.9991138359515, 2: 33.6495930512881, 3: 28.555818797764, 4: -52.2967967248258, 5: -91.3835208788233, 6: -73.9830692708321, 7: -5.16901145289629, 8: 29.8363012310241, 9: 10.6820057903939,
10: 19.4868517164395, 11: 15.4499668436458, 12: -17.0441644773509, 13: 10.7025053739577, 14: -8.6382953428539, 15: -32.8892974839165, 16: -15.8671863161348, 17: -11.237248036145, 18: -7.37978020066205, 19: -3.33500586334862,
20: -4.02629933182873, 21: -20.2413384726948, 22: -54.9094885578775, 23: -48.041459120976, 24: -52.3125732905322, 25: -35.6269065970458, 26: -62.0296155423529, 27: -49.0825017152659, 28: -73.0574478287598, 29: -50.9409090127938,
30: -63.4650928035253, 31: -55.1263264283842, 32: -52.2841103768755, 33: -61.2275334149805, 34: -74.2175990067417, 35: -68.2961107804698, 36: -76.6834643609286, 37: -70.16769103228}
})
with pm.Model() as myModel :
beta0 = pm.Normal('intercept', 0, 1)
beta1 = pm.Normal('x', 0, 1)
mu = beta0 + beta1 * myData['x'].values
pm.Bernoulli('obs', p = pm.invlogit(mu), observed = myData['y'].values)
with myModel :
calc = pm.sample(50000, tune = 10000, step = pm.Metropolis(), random_seed = 1000)
arviz.summary(calc, round_to = 10)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.537501 0.599667 -3.707061 -1.450243 0.004375 0.003118 18893.344191 22631.772985 1.000070
x 0.033750 0.024314 -0.007871 0.081619 0.000181 0.000133 18550.620475 20113.739639 1.000194
Now I changed above model to this,
mu = beta0 + beta1 * myData['x'].values * 0
With this change I get below result,
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.690874 0.546570 -3.698465 -1.643091 0.003611 0.002565 22980.471424 24806.935727 1.000036
x -0.013861 1.003612 -1.916176 1.826709 0.006874 0.005175 21336.662537 23299.680306 1.000084
I wonder if above estimate is correct. Should not I expect very small estimate for the coefficient beta1? I see hardly any change for this estimate except just change in sign.
Any pointer is highly appreciated.

"hardly any change for this estimate"
Seems like you are ignoring the sd, which has a strong change and is behaving as expected. That is, the first version yields 0.034 ± 0.024 (weakly positive); whereas the second correctly reverts to the prior with -0.014 ± 1.00.
Looking at the input data, none of this seems surprising:

pandas dataframe How to sequence rows based on conditions

I am trying to sequence an already sequenced dataset based on conditions. The dataframe looks like the following (the intended output is my desired output):
id
activity
duration_sec
cycle
intended_cycle
1
1
start
0.7
1
1
2
1
a
0.3
1
1
3
1
b
0.4
1
1
4
1
c
0.5
1
1
5
1
c
0.5
1
2
6
1
d
0.4
1
2
7
1
e
0.6
1
3
8
1
stop
2
1
3
9
1
start
0.1
2
4
10
1
b
0.3
2
4
11
1
stop
0.2
2
4
12
1
f
0.3
3
5
13
2
stop
40
4
6
14
2
start
3
5
7
15
2
a
0.7
5
7
16
2
a
3
5
7
17
2
b
0.2
5
7
18
2
stop
0.2
5
7
19
2
start
0.1
6
8
20
2
f
0.4
6
8
21
2
g
0.2
6
8
22
2
h
0.5
6
8
23
2
h
6
6
8
24
2
stop
9
6
8
25
2
start
0.2
7
9
26
2
e
0.3
7
9
27
2
f
0.4
7
10
28
2
stop
0.7
7
10
The alphabets are representative of activity names. I'm hoping to sequence as per the intended cycle column. This is based on the condition that within the current sequence:
the first activity duration>0.5s after the "start", would be considered as 1 sequence (or a sub-sequence, if you will).
the duplicate activities occurring consecutively will be considered as one if either value is >0.5s
the next activity that is >0.5s, after the first>0.5s would be considered as the next sequence
if there is no more activity >0.5s after, the sequence will be considered till the cycle column value changes
if there is no activity within the current cycle that is >0.5s, the activities will be considered independent (i.e. row 25-28)
I'd really appreciate any help. Big thanks to there being a community for this!
sample code for df:
data = {'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28},
'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2},
'activity': {0: 'start', 1: 'a', 2: 'b', 3: 'c', 4: 'c', 5: 'd', 6: 'e', 7: 'stop', 8: 'start', 9: 'b', 10: 'stop', 11: 'f', 12: 'stop', 13: 'start', 14: 'a', 15: 'a', 16: 'b', 17: 'stop', 18: 'start', 19: 'f', 20: 'g', 21: 'h', 22: 'h', 23: 'stop', 24: 'start', 25: 'e', 26: 'f', 27: 'stop'},
'duration_sec': {0: 0.7, 1: 0.3, 2: 0.4, 3: 0.5, 4: 0.5, 5: 0.4, 6: 0.6, 7: 2.0, 8: 0.1, 9: 0.3, 10: 0.2, 11: 0.3, 12: 40.0, 13: 3.0, 14: 0.7, 15: 3.0, 16: 0.2, 17: 0.2, 18: 0.1, 19: 0.4, 20: 0.2, 21: 0.5, 22: 6.0, 23: 9.0, 24: 0.2, 25: 0.3, 26: 0.4, 27: 0.7}, 'cycle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 3, 12: 4, 13: 5, 14: 5, 15: 5, 16: 5, 17: 5, 18: 6, 19: 6, 20: 6, 21: 6, 22: 6, 23: 6, 24: 7, 25: 7, 26: 7, 27: 7}, 'intended_cycle': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4, 10: 4, 11: 5, 12: 6, 13: 7, 14: 7, 15: 7, 16: 7, 17: 7, 18: 8, 19: 8, 20: 8, 21: 8, 22: 8, 23: 8, 24: 9, 25: 9, 26: 10, 27: 10}}
df = pd.DataFrame.from_dict(data)

How can I best create a new df using index values from another df that are used to retrieve multiple values?

The nn_idx_df contains index values that match the index of xyz_df. How can I get the values from column H in xyz_df and create new columns in nn_idx_df to match the result illustrated in output_df. I could hack my way through this, but would like to see a pandorable solution.
nn_idx_df = pd.DataFrame({'nn_1_idx': {0: 65, 1: 7, 2: 18},
'nn_2_idx': {0: 64, 1: 9, 2: 64},
'nn_3_idx': {0: 69, 1: 67, 2: 68},
'nn_4_idx': {0: 75, 1: 13, 2: 65},
'nn_5_idx': {0: 70, 1: 66, 2: 1}})
print(nn_idx_df)
nn_1_idx nn_2_idx nn_3_idx nn_4_idx nn_5_idx
0 65 64 69 75 70
1 7 9 67 13 66
2 18 64 68 65 1
xyz_df = pd.DataFrame({'X': {1: 6401652.35,
7: 6401845.46,
9: 6401671.93,
13: 6401868.98,
18: 6401889.78,
64: 6401725.71,
65: 6401663.04,
66: 6401655.89,
67: 6401726.33,
68: 6401755.92,
69: 6401755.23,
70: 6401766.23,
75: 6401825.9},
'Y': {1: 1858548.15,
7: 1858375.68,
9: 1858490.83,
13: 1858403.79,
18: 1858423.25,
64: 1858579.25,
65: 1858570.3,
66: 1858569.97,
67: 1858607.8,
68: 1858581.58,
69: 1858591.46,
70: 1858517.48,
75: 1858420.72},
'Z': {1: 467.62,
7: 482.22,
9: 459.15,
13: 485.17,
18: 488.35,
64: 488.88,
65: 465.75,
66: 467.35,
67: 486.12,
68: 490.12,
69: 490.68,
70: 483.96,
75: 467.39},
'H': {1: 47.8791,
7: 45.5502,
9: 46.0995,
13: 41.9554,
18: 41.0537,
64: 47.1215,
65: 46.0047,
66: 45.936,
67: 40.5807,
68: 37.8478,
69: 37.1639,
70: 37.2314,
75: 25.8446}})
print(xyz_df)
X Y Z H
1 6401652.35 1858548.15 467.62 47.8791
7 6401845.46 1858375.68 482.22 45.5502
9 6401671.93 1858490.83 459.15 46.0995
13 6401868.98 1858403.79 485.17 41.9554
18 6401889.78 1858423.25 488.35 41.0537
64 6401725.71 1858579.25 488.88 47.1215
65 6401663.04 1858570.30 465.75 46.0047
66 6401655.89 1858569.97 467.35 45.9360
67 6401726.33 1858607.80 486.12 40.5807
68 6401755.92 1858581.58 490.12 37.8478
69 6401755.23 1858591.46 490.68 37.1639
70 6401766.23 1858517.48 483.96 37.2314
75 6401825.90 1858420.72 467.39 25.8446
output_df = pd.DataFrame(
{'nn_1_idx': {0: 65, 1: 7, 2: 18},
'nn_2_idx': {0: 64, 1: 9, 2: 64},
'nn_3_idx': {0: 69, 1: 67, 2: 68},
'nn_4_idx': {0: 75, 1: 13, 2: 65},
'nn_5_idx': {0: 70, 1: 66, 2: 1},
'nn_1_idx_h': {0: 46.0047, 1: 45.5502, 2: 41.0537},
'nn_2_idx_h': {0: 47.1215, 1: 46.0995, 2: 47.1215},
'nn_3_idx_h': {0: 37.1639, 1:40.5807, 2: 37.8478},
'nn_4_idx_h': {0: 25.8446, 1: 41.9554, 2: 46.0047},
'nn_5_idx_h': {0: 37.2314, 1: 45.9360, 2: 47.8791}})
print(output_df)
nn_1_idx nn_2_idx nn_3_idx nn_4_idx nn_5_idx nn_1_idx_h nn_2_idx_h nn_3_idx_h nn_4_idx_h nn_5_idx_h
0 65 64 69 75 70 46.0047 47.1215 37.1639 25.8446 37.2314
1 7 9 67 13 66 45.5502 46.0995 40.5807 41.9554 45.9360
2 18 64 68 65 1 41.0537 47.1215 37.8478 46.0047 47.8791

Let us do replace with join
df=nn_idx_df.join(nn_idx_df.replace(xyz_df.H).add_suffix('_h'))
df
nn_1_idx nn_2_idx nn_3_idx ... nn_3_idx_h nn_4_idx_h nn_5_idx_h
0 65 64 69 ... 37.1639 25.8446 37.2314
1 7 9 67 ... 40.5807 41.9554 45.9360
2 18 64 68 ... 37.8478 46.0047 47.8791
[3 rows x 10 columns]

file with multiple headers to dataframe with melt

+------+------+------+------+------+------+-------+----+
| | | | | USD | EUR | JPY | RUP |
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| | | | | Case | Cons | Case | Case|
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| | | | | High | Low | CWM | AEP |
+------+------+------+------+------+------+------+-----+
+------+------+------+------+------+------+------+-----+
| Col1 | Col2 | Col3 | Col4 | Owner| OPS | VH |Delta|
+------+------+------+------+------+------+------+-----+
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 |
| V1a | V2a | V3a | V4a | V5a | V6a | V7a | V8a |
+------+------+------+------+------+------+------+-----+
as requested here is the sample data as output by df.to_dict():
{('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2', 'Year'): {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020, 5: 2020, 6: 2020, 7: 2020, 8: 2020, 9: 2020, 10: 2020, 11: 2020, 12: 2020, 13: 2020, 14: 2020, 15: 2020, 16: 2020, 17: 2020, 18: 2020, 19: 2020, 20: 2020, 21: 2020, 22: 2020, 23: 2020, 24: 2020, 25: 2020, 26: 2020, 27: 2020, 28: 2020, 29: 2020, 30: 2020, 31: 2020, 32: 2020, 33: 2020, 34: 2020, 35: 2020, 36: 2020, 37: 2020, 38: 2020, 39: 2020, 40: 2020, 41: 2020, 42: 2020, 43: 2020, 44: 2020, 45: 2020, 46: 2020, 47: 2020}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'Unnamed: 1_level_2', 'Month'): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1, 32: 1, 33: 1, 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'Unnamed: 2_level_2', 'Day'): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 1, 24: 2, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2, 30: 2, 31: 2, 32: 2, 33: 2, 34: 2, 35: 2, 36: 2, 37: 2, 38: 2, 39: 2, 40: 2, 41: 2, 42: 2, 43: 2, 44: 2, 45: 2, 46: 2, 47: 2}, ('Unnamed: 3_level_0', 'Unnamed: 3_level_1', 'Unnamed: 3_level_2', 'Hour'): {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 0, 25: 1, 26: 2, 27: 3, 28: 4, 29: 5, 30: 6, 31: 7, 32: 8, 33: 9, 34: 10, 35: 11, 36: 12, 37: 13, 38: 14, 39: 15, 40: 16, 41: 17, 42: 18, 43: 19, 44: 20, 45: 21, 46: 22, 47: 23}, ('USD', 'Cons', 'very high', 'Hub1'): {0: 23.06, 1: 21.49, 2: 21.73, 3: 21.58, 4: 21.67, 5: 22.78, 6: 27.15, 7: 26.09, 8: 26.23, 9: 28.21, 10: 29.21, 11: 31.97, 12: 30.45, 13: 30.45, 14: 30.45, 15: 29.14, 16: 28.28, 17: 26.35, 18: 26.32, 19: 27.01, 20: 26.34, 21: 28.22, 22: 27.77, 23: 26.94, 24: 24.16, 25: 22.74, 26: 22.67, 27: 22.67, 28: 22.74, 29: 23.14, 30: 27.81, 31: 27.87, 32: 28.05, 33: 27.91, 34: 32.66, 35: 35.14, 36: 33.32, 37: 36.17, 38: 38.33, 39: 31.75, 40: 30.9, 41: 26.36, 42: 27.17, 43: 28.17, 44: 26.17, 45: 26.5, 46: 28.95, 47: 26.94}, ('EUR', 'Case', 'CWM', 'Hub2'): {0: 18.59, 1: 18.32, 2: 18.32, 3: 18.32, 4: 18.32, 5: 19.19, 6: 22.57, 7: 25.38, 8: 25.53, 9: 25.9, 10: 26.47, 11: 26.47, 12: 26.09, 13: 25.59, 14: 25.35, 15: 24.97, 16: 24.22, 17: 25.22, 18: 25.49, 19: 26.19, 20: 25.63, 21: 25.1, 22: 21.93, 23: 19.61, 24: 19.4, 25: 18.75, 26: 18.85, 27: 18.75, 28: 18.88, 29: 19.41, 30: 23.97, 31: 27.07, 32: 27.23, 33: 29.21, 34: 30.49, 35: 28.52, 36: 27.49, 37: 26.93, 38: 26.71, 39: 25.76, 40: 25.24, 41: 25.67, 42: 26.72, 43: 27.98, 44: 26.73, 45: 25.97, 46: 22.34, 47: 19.47}, ('USD', 'Cons', 'Ventyx', 'Hub3'): {0: 19.78, 1: 20.96, 2: 21.58, 3: 21.5, 4: 21.27, 5: 22.59, 6: 26.22, 7: 26.78, 8: 26.78, 9: 26.97, 10: 26.97, 11: 26.97, 12: 26.53, 13: 26.34, 14: 26.5, 15: 26.22, 16: 25.6, 17: 26.5, 18: 26.74, 19: 27.44, 20: 26.87, 21: 26.5, 22: 23.2, 23: 23.58, 24: 22.74, 25: 22.31, 26: 22.27, 27: 22.27, 28: 22.74, 29: 22.84, 30: 27.79, 31: 31.63, 32: 29.6, 33: 29.25, 34: 30.53, 35: 28.51, 36: 27.48, 37: 26.97, 38: 26.74, 39: 26.53, 40: 26.5, 41: 26.92, 42: 28.89, 43: 30.24, 44: 28.38, 45: 27.38, 46: 24.39, 47: 23.2}}
That is about as good a representation as I can make for this file.
Columns 1-4 have a single header Columns 5-N (yes N, because we don't know how many) have 4 headers.
The dataframe needs to look like this:
+------+------+------+------+------+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 | NCol1| NCol2|NCol3 | NCol4| Col9 |
+------+------+------+------+------+------+------+------+------+
| V1 | V2 | V3 | V4 | USD | Case | High | Owner| V5 |
| V1a | V2a | V3a | V4a | USD | Case | High | Owner| V5a |
| V1a | V2a | V3a | V4a | EUR | Cons | Low | Ops | V6 |
| V1a | V2a | V3a | V4a | EUR | Cons | Low | Ops | V6a |
| V1a | V2a | V3a | V4a | JPY | Case | CWM | VH | V7 |
| V1a | V2a | V3a | V4a | JPY | Case | CWM | VH | V7a |
| V1a | V2a | V3a | V4a | RUP | Case | AEP | Delta| V8 |
| V1a | V2a | V3a | V4a | RUP | Case | AEP | Delta| V8a |
+------+------+------+------+------+------+-----+------+-------+
So essentially pivot the 5th through N column headers into new columns where each row of data is aligned with the first 4 columns and the headers the values were originally under.
I tried:
df = pd.read_csv(file,header=[0,1,2,3])
df.melt(var_name=['a','b','c','d'], value_name='e')
Also:
df2 = df.melt(id_vars=['Year','Month','Day','Hour'], col_level=3)
And :
df2 = df.stack().stack().stack().stack()
That last one is ver close, but it does the first 4 columns
However that doesn't work as it gives me just col1 and col2.

I feel like I am shooting in the dark but this is what i could pull out. lemme know if it is not what you want. if it isn't, kindly post a sample output based on the dict you posted so others can butt in. and i'll gladly delete this hack
df = pd.DataFrame(sample)
df.columns =df.columns.to_flat_index()
df.columns = ['_'.join(i) for i in df.columns]
df = df.melt(id_vars=['Unnamed: 0_level_0_Unnamed: 0_level_1_Unnamed:
0_level_2_Year',
'Unnamed: 1_level_0_Unnamed: 1_level_1_Unnamed: 1_level_2_Month',
'Unnamed: 2_level_0_Unnamed: 2_level_1_Unnamed: 2_level_2_Day',
'Unnamed: 3_level_0_Unnamed: 3_level_1_Unnamed: 3_level_2_Hour'])
df.columns = [i.split('_')[-1] for i in df.columns]
pd.concat([df,df.variable.str.split('_',expand=True)],axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

deleting rows that have same date time - python

I will round seconds then check duplicates and then subset or drop duplicates of the rounded datetime df[~df['dateTime'].dt.round('min').duplicated()]

Related

How to average across two dataframes

Strange result from pymc3 for small change in the variable

pandas dataframe How to sequence rows based on conditions

How can I best create a new df using index values from another df that are used to retrieve multiple values?

file with multiple headers to dataframe with melt

Categories

Resources