How to average across two dataframes

How to average across two dataframes - python

I have two dataframes:
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
For every 'id' that exists in both dataframes I would like to compute the average of their values in 'total' and have that in a new dataframe.
I tried:
pd.merge(df1, df2, on="id")
with the hope that I could then do:
merged_df[['total']].mean(axis=1)
but it doesn't work at all.
How can you do this?

You could use:
df1.merge(df2, on='id').set_index('id').mean(axis=1).reset_index(name='total')
Or, if you have many columns, a more generic approach:
(df1.merge(df2, on='id', suffixes=(None, '_other')).set_index('id')
.rename(columns=lambda x: x.removesuffix('_other')) # requires python 3.9+
.groupby(axis=1, level=0)
.mean().reset_index()
)
Output:
id total
0 1548638 17.0
1 1953603 38.0
2 1956216 59.0
3 1962245 40.0
4 1981386 40.0
5 1981773 40.0
6 2004787 80.0
7 2017418 44.0
8 2020989 51.0
9 2045043 46.0

You can do like the below:
df1 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
df2 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
merged_df = df1.merge(df2, on='id')
merged_df['total_mean'] = merged_df.filter(regex='total').mean(axis=1)
print(merged_df)
Output:
id total_x total_y total_mean
0 1548638 17 17 17.0
1 1953603 38 38 38.0
2 1956216 59 59 59.0
3 1962245 40 40 40.0
4 1981386 40 40 40.0
5 1981773 40 40 40.0
6 2004787 80 80 80.0
7 2017418 44 44 44.0
8 2020989 51 51 51.0
9 2045043 46 46 46.0

Related

Double header dataframe, sumif (possibly groupby?) with python

So here is an image of what I have and what I want to get: https://imgur.com/a/RyDbvZD
Basically Those are SUMIF formulas in excel, I would like to recreate that in python, I was trying with pandas groupby().sum() function but I have no clue how to groupby on 2 headers like this, and then how to order the data.
Original dataframe:
df = pd.DataFrame( {'Group': {0: 'Name', 1: 20201001, 2: 20201002, 3: 20201003, 4: 20201004, 5: 20201005, 6: 20201006, 7: 20201007, 8: 20201008, 9: 20201009, 10: 20201010}, 'Credit': {0: 'Credit', 1: 65, 2: 69, 3: 92, 4: 18, 5: 58, 6: 12, 7: 31, 8: 29, 9: 12, 10: 41}, 'Equity': {0: 'Stock', 1: 92, 2: 62, 3: 54, 4: 52, 5: 14, 6: 5, 7: 14, 8: 17, 9: 54, 10: 51}, 'Equity.1': {0: 'Option', 1: 87, 2: 30, 3: 40, 4: 24, 5: 95, 6: 77, 7: 44, 8: 77, 9: 88, 10: 85}, 'Credit.1': {0: 'Credit', 1: 62, 2: 60, 3: 91, 4: 57, 5: 65, 6: 50, 7: 75, 8: 55, 9: 48, 10: 99}, 'Equity.2': {0: 'Option', 1: 61, 2: 91, 3: 38, 4: 3, 5: 71, 6: 51, 7: 74, 8: 41, 9: 59, 10: 31}, 'Bond': {0: 'Bond', 1: 4, 2: 62, 3: 91, 4: 66, 5: 30, 6: 51, 7: 76, 8: 6, 9: 65, 10: 73}, 'Unnamed: 7': {0: 'Stock', 1: 54, 2: 23, 3: 74, 4: 92, 5: 36, 6: 89, 7: 88, 8: 32, 9: 19, 10: 91}, 'Bond.1': {0: 'Bond', 1: 96, 2: 10, 3: 11, 4: 7, 5: 28, 6: 82, 7: 13, 8: 46, 9: 70, 10: 46}, 'Bond.2': {0: 'Bond', 1: 25, 2: 53, 3: 96, 4: 70, 5: 52, 6: 9, 7: 98, 8: 9, 9: 48, 10: 58}, 'Unnamed: 10': {0: float('nan'), 1: 63.0, 2: 80.0, 3: 17.0, 4: 21.0, 5: 30.0, 6: 78.0, 7: 23.0, 8: 31.0, 9: 72.0, 10: 65.0}} )
What I want at the end:
df = pd.DataFrame( {'Group': {0: 20201001, 1: 20201002, 2: 20201003, 3: 20201004, 4: 20201005, 5: 20201006, 6: 20201007, 7: 20201008, 8: 20201009, 9: 20201010}, 'Credit': {0: 127, 1: 129, 2: 183, 3: 75, 4: 123, 5: 62, 6: 106, 7: 84, 8: 60, 9: 140}, 'Equity': {0: 240, 1: 183, 2: 132, 3: 79, 4: 180, 5: 133, 6: 132, 7: 135, 8: 201, 9: 167}, 'Stock': {0: 146, 1: 85, 2: 128, 3: 144, 4: 50, 5: 94, 6: 102, 7: 49, 8: 73, 9: 142}, 'Option': {0: 148, 1: 121, 2: 78, 3: 27, 4: 166, 5: 128, 6: 118, 7: 118, 8: 147, 9: 116}} )
Any ideas where to start on this, or anything is appreciated

Here you go. First row seems to be the real headers so we first move that to column names and set the index to Name
df2 = df.rename(columns = df.loc[0]).drop(index = 0).set_index(['Name'])
Then we groupby by columns and sum
df2.groupby(df2.columns, axis=1, sort = False).sum().reset_index()
and we get
Name Credit Stock Option Bond
0 20201001 127.0 146.0 148.0 125.0
1 20201002 129.0 85.0 121.0 125.0
2 20201003 183.0 128.0 78.0 198.0
3 20201004 75.0 144.0 27.0 143.0
4 20201005 123.0 50.0 166.0 110.0
5 20201006 62.0 94.0 128.0 142.0
6 20201007 106.0 102.0 118.0 187.0
7 20201008 84.0 49.0 118.0 61.0
8 20201009 60.0 73.0 147.0 183.0
9 20201010 140.0 142.0 116.0 177.0
I realise the output is not exactly what you asked for but since we cannot see your SUMIF formulas, I do not know which columns you want to aggregate
Edit
Following up on your comment, I note that, as far as I can tell, the rules for aggregation are somewhat messy so that the same column is included in more than one output column (like Equity.1). I do not think there is much you can do with automation here, and you can replicate your SUMIF experience by directly referencing the columns you want to add. So I think the following gives you what you want
df = df.drop(index =0)
df2 = df[['Group']].copy()
df2['Credit'] = df['Credit'] + df['Credit.1']
df2['Equity'] = df['Equity'] + df['Equity.1']+ df['Equity.2']
df2['Stock'] = df['Equity'] + df['Unnamed: 7']
df2['Option'] = df['Equity.1'] + df['Equity.2']
df2
produces
Group Credit Equity Stock Option
-- -------- -------- -------- ------- --------
1 20201001 127 240 146 148
2 20201002 129 183 85 121
3 20201003 183 132 128 78
4 20201004 75 79 144 27
5 20201005 123 180 50 166
6 20201006 62 133 94 128
7 20201007 106 132 102 118
8 20201008 84 135 49 118
9 20201009 60 201 73 147
10 20201010 140 167 142 116
This also gives you control over which columns to include in the final output
If you want this more automated than you need to do something about labels of your columns, as you would want a unique label for a set of columns you want to aggregate. If the same input column is used in more than one calculation it is probably easiest to just duplicate it with the right labels

Pandas Resampling with delta time after specific starting time

After reading a CSV into a data frame, I am trying to resample my "Value" column to 5 seconds, starting from the first rounded second of the time value. I would like to have the mean for all the values within the next 5 seconds, starting from 46:19.6 (format %M:%S:%f). So the code would give me the mean for 46:20, then 46:25, and so on...Does anybody know how to do this? Thank you!
input:
df = pd.DataFrame({'Time': {0: '46:19.6',
1: '46:20.7',
2: '46:21.8',
3: '46:22.9',
4: '46:24.0',
5: '46:25.1',
6: '46:26.2',
7: '46:27.6',
8: '46:28.7',
9: '46:29.8',
10: '46:30.9',
11: '46:32.0',
12: '46:33.2',
13: '46:34.3',
14: '46:35.3',
15: '46:36.5',
16: '46:38.8',
17: '46:40.0'},
'Value': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 17,
16: 19,
17: 20}})

Assuming your Time field is in datetime64[ns] format, you simply can use pd.Grouper and pass freq=5S:
# next line of code is optional to transform to datetime format if the `Time` field is an `object` i.e. string.
# df['Time'] = pd.to_datetime('00:'+df['Time'])
df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index()
#Depending on what you want to do, you can also replace the above line of code with one of two below:
#df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index().iloc[1:]
#df1 = df.groupby(pd.Grouper(key='Time', freq='5S', base=4.6))['Value'].mean().reset_index()
#In the above line of code 4.6s can be adjusted to whatever number between 0 and 5.
df1
output:
Time Value
0 2020-07-07 00:46:15 0.0
1 2020-07-07 00:46:20 2.5
2 2020-07-07 00:46:25 7.6
3 2020-07-07 00:46:30 12.5
4 2020-07-07 00:46:35 17.0
5 2020-07-07 00:46:40 20.0
Full reproducible code from an example DataFrame I created:
import re
import pandas
df = pd.DataFrame({'Time': {0: '46:19.6',
1: '46:20.7',
2: '46:21.8',
3: '46:22.9',
4: '46:24.0',
5: '46:25.1',
6: '46:26.2',
7: '46:27.6',
8: '46:28.7',
9: '46:29.8',
10: '46:30.9',
11: '46:32.0',
12: '46:33.2',
13: '46:34.3',
14: '46:35.3',
15: '46:36.5',
16: '46:38.8',
17: '46:40.0'},
'Value': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 17,
16: 19,
17: 20}})
df['Time'] = pd.to_datetime('00:'+df['Time'])
df1 = df.groupby(pd.Grouper(key='Time', freq='5S'))['Value'].mean().reset_index()
df1

Is there a formulaic approach to find the frequency of the sum of combinations?

I have 5 strawberries, 2 lemons, and a banana. For each possible combination of these (including selecting 0), there is a total number of objects. I ultimately want a list of the frequencies at which these sums appear.
[1 strawberry, 0 lemons, 0 bananas] = 1 objects
[2 strawberries, 0 lemons, 1 banana] = 3 objects
[0 strawberries, 1 lemon, 0 bananas] = 1 objects
[2 strawberries, 1 lemon, 0 bananas] = 3 objects
[3 strawberries, 0 lemons, 0 bananas] = 3 objects
For just the above selection of 5 combinations, "1" has a frequency of 2 and "3" has a frequency of 3.
Obviously there are far more possible combinations, each changing the frequency result. Is there a formulaic way to approach the problem to find the frequencies for an entire set of combinations?
Currently, I've set up a brute-force function in Python.
special_cards = {
'A':7, 'B':1, 'C':1, 'D':1, 'E':1, 'F':1, 'G':1, 'H':1, 'I':1, 'J':1, 'K':1, 'L':1,
'M':1, 'N':1, 'O':1, 'P':1, 'Q':1, 'R':1, 'S':1, 'T':1, 'U':1, 'V':1, 'W':1, 'X':1,
'Y':1, 'Z':1, 'AA':1, 'AB':1, 'AC':1, 'AD':1, 'AE':1, 'AF':1, 'AG':1, 'AH':1, 'AI':1, 'AJ':1,
'AK':1, 'AL':1, 'AM':1, 'AN':1, 'AO':1, 'AP':1, 'AQ':1, 'AR':1, 'AS':1, 'AT':1, 'AU':1, 'AV':1,
'AW':1, 'AX':1, 'AY':1
}
def _calc_dis_specials(special_cards):
"""Calculate the total combinations when special cards are factored in"""
# Create an iterator for special card combinations.
special_paths = _gen_dis_special_list(special_cards)
freq = {}
path_count = 0
for o_path in special_paths: # Loop through the iterator
path_count += 1 # Keep track of how many combinations we've evaluated thus far.
try: # I've been told I can use a collections.counter() object instead of try/except.
path_sum = sum(o_path) # Sum the path (counting objects)
new_count = freq[path_sum] + 1 # Try to increment the count for our sum.
freq.update({path_sum: new_count})
except KeyError:
freq.update({path_sum: 1})
print(f"{path_count:,}\n{freq}")
print(f"{path_count:,}\n{freq}")
# Do things with results yadda yadda
def _gen_dis_special_list(special_cards):
"""Generates an iterator for all combinations for special cards"""
product_args = []
for value in special_cards.values(): # A card's "value" is the maximum number that can be in a deck.
product_args.append(range(value+1)) # Populates product_args with lists of each card's possible count.
result = itertools.product(*product_args)
return result
However, for large numbers of object pools (50+) the factorial just gets out of hand. Billions upon billions of combinations. I need a formulaic approach.
Looking at some output, I notice a couple of things:
1
{0: 1}
2
{0: 1, 1: 1}
4
{0: 1, 1: 2, 2: 1}
8
{0: 1, 1: 3, 2: 3, 3: 1}
16
{0: 1, 1: 4, 2: 6, 3: 4, 4: 1}
32
{0: 1, 1: 5, 2: 10, 3: 10, 4: 5, 5: 1}
64
{0: 1, 1: 6, 2: 15, 3: 20, 4: 15, 5: 6, 6: 1}
128
{0: 1, 1: 7, 2: 21, 3: 35, 4: 35, 5: 21, 6: 7, 7: 1}
256
{0: 1, 1: 8, 2: 28, 3: 56, 4: 70, 5: 56, 6: 28, 7: 8, 8: 1}
512
{0: 1, 1: 9, 2: 36, 3: 84, 4: 126, 5: 126, 6: 84, 7: 36, 8: 9, 9: 1}
1,024
{0: 1, 1: 10, 2: 45, 3: 120, 4: 210, 5: 252, 6: 210, 7: 120, 8: 45, 9: 10, 10: 1}
2,048
{0: 1, 1: 11, 2: 55, 3: 165, 4: 330, 5: 462, 6: 462, 7: 330, 8: 165, 9: 55, 10: 11, 11: 1}
4,096
{0: 1, 1: 12, 2: 66, 3: 220, 4: 495, 5: 792, 6: 924, 7: 792, 8: 495, 9: 220, 10: 66, 11: 12, 12: 1}
8,192
{0: 1, 1: 13, 2: 78, 3: 286, 4: 715, 5: 1287, 6: 1716, 7: 1716, 8: 1287, 9: 715, 10: 286, 11: 78, 12: 13, 13: 1}
16,384
{0: 1, 1: 14, 2: 91, 3: 364, 4: 1001, 5: 2002, 6: 3003, 7: 3432, 8: 3003, 9: 2002, 10: 1001, 11: 364, 12: 91, 13: 14, 14: 1}
32,768
{0: 1, 1: 15, 2: 105, 3: 455, 4: 1365, 5: 3003, 6: 5005, 7: 6435, 8: 6435, 9: 5005, 10: 3003, 11: 1365, 12: 455, 13: 105, 14: 15, 15: 1}
65,536
{0: 1, 1: 16, 2: 120, 3: 560, 4: 1820, 5: 4368, 6: 8008, 7: 11440, 8: 12870, 9: 11440, 10: 8008, 11: 4368, 12: 1820, 13: 560, 14: 120, 15: 16, 16: 1}
131,072
{0: 1, 1: 17, 2: 136, 3: 680, 4: 2380, 5: 6188, 6: 12376, 7: 19448, 8: 24310, 9: 24310, 10: 19448, 11: 12376, 12: 6188, 13: 2380, 14: 680, 15: 136, 16: 17, 17: 1}
262,144
{0: 1, 1: 18, 2: 153, 3: 816, 4: 3060, 5: 8568, 6: 18564, 7: 31824, 8: 43758, 9: 48620, 10: 43758, 11: 31824, 12: 18564, 13: 8568, 14: 3060, 15: 816, 16: 153, 17: 18, 18: 1}
524,288
{0: 1, 1: 19, 2: 171, 3: 969, 4: 3876, 5: 11628, 6: 27132, 7: 50388, 8: 75582, 9: 92378, 10: 92378, 11: 75582, 12: 50388, 13: 27132, 14: 11628, 15: 3876, 16: 969, 17: 171, 18: 19, 19: 1}
1,048,576
{0: 1, 1: 20, 2: 190, 3: 1140, 4: 4845, 5: 15504, 6: 38760, 7: 77520, 8: 125970, 9: 167960, 10: 184756, 11: 167960, 12: 125970, 13: 77520, 14: 38760, 15: 15504, 16: 4845, 17: 1140, 18: 190, 19: 20, 20: 1}
2,097,152
{0: 1, 1: 21, 2: 210, 3: 1330, 4: 5985, 5: 20349, 6: 54264, 7: 116280, 8: 203490, 9: 293930, 10: 352716, 11: 352716, 12: 293930, 13: 203490, 14: 116280, 15: 54264, 16: 20349, 17: 5985, 18: 1330, 19: 210, 20: 21, 21: 1}
4,194,304
{0: 1, 1: 22, 2: 231, 3: 1540, 4: 7315, 5: 26334, 6: 74613, 7: 170544, 8: 319770, 9: 497420, 10: 646646, 11: 705432, 12: 646646, 13: 497420, 14: 319770, 15: 170544, 16: 74613, 17: 26334, 18: 7315, 19: 1540, 20: 231, 21: 22, 22: 1}
8,388,608
{0: 1, 1: 23, 2: 253, 3: 1771, 4: 8855, 5: 33649, 6: 100947, 7: 245157, 8: 490314, 9: 817190, 10: 1144066, 11: 1352078, 12: 1352078, 13: 1144066, 14: 817190, 15: 490314, 16: 245157, 17: 100947, 18: 33649, 19: 8855, 20: 1771, 21: 253, 22: 23, 23: 1}
16,777,216
{0: 1, 1: 24, 2: 276, 3: 2024, 4: 10626, 5: 42504, 6: 134596, 7: 346104, 8: 735471, 9: 1307504, 10: 1961256, 11: 2496144, 12: 2704156, 13: 2496144, 14: 1961256, 15: 1307504, 16: 735471, 17: 346104, 18: 134596, 19: 42504, 20: 10626, 21: 2024, 22: 276, 23: 24, 24: 1}
33,554,432
{0: 1, 1: 25, 2: 300, 3: 2300, 4: 12650, 5: 53130, 6: 177100, 7: 480700, 8: 1081575, 9: 2042975, 10: 3268760, 11: 4457400, 12: 5200300, 13: 5200300, 14: 4457400, 15: 3268760, 16: 2042975, 17: 1081575, 18: 480700, 19: 177100, 20: 53130, 21: 12650, 22: 2300, 23: 300, 24: 25, 25: 1}
67,108,864
{0: 1, 1: 26, 2: 325, 3: 2600, 4: 14950, 5: 65780, 6: 230230, 7: 657800, 8: 1562275, 9: 3124550, 10: 5311735, 11: 7726160, 12: 9657700, 13: 10400600, 14: 9657700, 15: 7726160, 16: 5311735, 17: 3124550, 18: 1562275, 19: 657800, 20: 230230, 21: 65780, 22: 14950, 23: 2600, 24: 325, 25: 26, 26: 1}
134,217,728
{0: 1, 1: 27, 2: 351, 3: 2925, 4: 17550, 5: 80730, 6: 296010, 7: 888030, 8: 2220075, 9: 4686825, 10: 8436285, 11: 13037895, 12: 17383860, 13: 20058300, 14: 20058300, 15: 17383860, 16: 13037895, 17: 8436285, 18: 4686825, 19: 2220075, 20: 888030, 21: 296010, 22: 80730, 23: 17550, 24: 2925, 25: 351, 26: 27, 27: 1}
268,435,456
{0: 1, 1: 28, 2: 378, 3: 3276, 4: 20475, 5: 98280, 6: 376740, 7: 1184040, 8: 3108105, 9: 6906900, 10: 13123110, 11: 21474180, 12: 30421755, 13: 37442160, 14: 40116600, 15: 37442160, 16: 30421755, 17: 21474180, 18: 13123110, 19: 6906900, 20: 3108105, 21: 1184040, 22: 376740, 23: 98280, 24: 20475, 25: 3276, 26: 378, 27: 28, 28: 1}
536,870,912
{0: 1, 1: 29, 2: 406, 3: 3654, 4: 23751, 5: 118755, 6: 475020, 7: 1560780, 8: 4292145, 9: 10015005, 10: 20030010, 11: 34597290, 12: 51895935, 13: 67863915, 14: 77558760, 15: 77558760, 16: 67863915, 17: 51895935, 18: 34597290, 19: 20030010, 20: 10015005, 21: 4292145, 22: 1560780, 23: 475020, 24: 118755, 25: 23751, 26: 3654, 27: 406, 28: 29, 29: 1}
1,073,741,824
{0: 1, 1: 30, 2: 435, 3: 4060, 4: 27405, 5: 142506, 6: 593775, 7: 2035800, 8: 5852925, 9: 14307150, 10: 30045015, 11: 54627300, 12: 86493225, 13: 119759850, 14: 145422675, 15: 155117520, 16: 145422675, 17: 119759850, 18: 86493225, 19: 54627300, 20: 30045015, 21: 14307150, 22: 5852925, 23: 2035800, 24: 593775, 25: 142506, 26: 27405, 27: 4060, 28: 435, 29: 30, 30: 1}
Note that I'm only printing when a new key (sum) is found.
I notice that
a new sum is found only on powers of 2 and
the results are symmetrical.
This hints to me that there's a formulaic approach that could work.
Any ideas on how to proceed?

Good news; there is a formula for this, and I'll explain the path there in case there is any confusion.
Let's look at your initial example: 5 strawberries (S), 2 lemons (L), and a banana (B). Let's lay out all of the fruits:
S S S S S L L B
We can actually rephrase the question now, because the number of times that 3, for example, will be the total number is the number of different ways you can pick 3 of the fruits from this list.
In statistics, the choose function (a.k.a nCk), answers just this question: how many ways are there to select a group of k items from a group of n items. This is computed as n!/((n-k)!*k!), where "!" is the factorial, a number multiplied by all numbers less than itself. As such the frequency of 3s would be (the number of fruits) "choose" (the total in question), or 8 choose 3. This is 8!/(5!*3!) = 56.

How can I best create a new df using index values from another df that are used to retrieve multiple values?

The nn_idx_df contains index values that match the index of xyz_df. How can I get the values from column H in xyz_df and create new columns in nn_idx_df to match the result illustrated in output_df. I could hack my way through this, but would like to see a pandorable solution.
nn_idx_df = pd.DataFrame({'nn_1_idx': {0: 65, 1: 7, 2: 18},
'nn_2_idx': {0: 64, 1: 9, 2: 64},
'nn_3_idx': {0: 69, 1: 67, 2: 68},
'nn_4_idx': {0: 75, 1: 13, 2: 65},
'nn_5_idx': {0: 70, 1: 66, 2: 1}})
print(nn_idx_df)
nn_1_idx nn_2_idx nn_3_idx nn_4_idx nn_5_idx
0 65 64 69 75 70
1 7 9 67 13 66
2 18 64 68 65 1
xyz_df = pd.DataFrame({'X': {1: 6401652.35,
7: 6401845.46,
9: 6401671.93,
13: 6401868.98,
18: 6401889.78,
64: 6401725.71,
65: 6401663.04,
66: 6401655.89,
67: 6401726.33,
68: 6401755.92,
69: 6401755.23,
70: 6401766.23,
75: 6401825.9},
'Y': {1: 1858548.15,
7: 1858375.68,
9: 1858490.83,
13: 1858403.79,
18: 1858423.25,
64: 1858579.25,
65: 1858570.3,
66: 1858569.97,
67: 1858607.8,
68: 1858581.58,
69: 1858591.46,
70: 1858517.48,
75: 1858420.72},
'Z': {1: 467.62,
7: 482.22,
9: 459.15,
13: 485.17,
18: 488.35,
64: 488.88,
65: 465.75,
66: 467.35,
67: 486.12,
68: 490.12,
69: 490.68,
70: 483.96,
75: 467.39},
'H': {1: 47.8791,
7: 45.5502,
9: 46.0995,
13: 41.9554,
18: 41.0537,
64: 47.1215,
65: 46.0047,
66: 45.936,
67: 40.5807,
68: 37.8478,
69: 37.1639,
70: 37.2314,
75: 25.8446}})
print(xyz_df)
X Y Z H
1 6401652.35 1858548.15 467.62 47.8791
7 6401845.46 1858375.68 482.22 45.5502
9 6401671.93 1858490.83 459.15 46.0995
13 6401868.98 1858403.79 485.17 41.9554
18 6401889.78 1858423.25 488.35 41.0537
64 6401725.71 1858579.25 488.88 47.1215
65 6401663.04 1858570.30 465.75 46.0047
66 6401655.89 1858569.97 467.35 45.9360
67 6401726.33 1858607.80 486.12 40.5807
68 6401755.92 1858581.58 490.12 37.8478
69 6401755.23 1858591.46 490.68 37.1639
70 6401766.23 1858517.48 483.96 37.2314
75 6401825.90 1858420.72 467.39 25.8446
output_df = pd.DataFrame(
{'nn_1_idx': {0: 65, 1: 7, 2: 18},
'nn_2_idx': {0: 64, 1: 9, 2: 64},
'nn_3_idx': {0: 69, 1: 67, 2: 68},
'nn_4_idx': {0: 75, 1: 13, 2: 65},
'nn_5_idx': {0: 70, 1: 66, 2: 1},
'nn_1_idx_h': {0: 46.0047, 1: 45.5502, 2: 41.0537},
'nn_2_idx_h': {0: 47.1215, 1: 46.0995, 2: 47.1215},
'nn_3_idx_h': {0: 37.1639, 1:40.5807, 2: 37.8478},
'nn_4_idx_h': {0: 25.8446, 1: 41.9554, 2: 46.0047},
'nn_5_idx_h': {0: 37.2314, 1: 45.9360, 2: 47.8791}})
print(output_df)
nn_1_idx nn_2_idx nn_3_idx nn_4_idx nn_5_idx nn_1_idx_h nn_2_idx_h nn_3_idx_h nn_4_idx_h nn_5_idx_h
0 65 64 69 75 70 46.0047 47.1215 37.1639 25.8446 37.2314
1 7 9 67 13 66 45.5502 46.0995 40.5807 41.9554 45.9360
2 18 64 68 65 1 41.0537 47.1215 37.8478 46.0047 47.8791

Let us do replace with join
df=nn_idx_df.join(nn_idx_df.replace(xyz_df.H).add_suffix('_h'))
df
nn_1_idx nn_2_idx nn_3_idx ... nn_3_idx_h nn_4_idx_h nn_5_idx_h
0 65 64 69 ... 37.1639 25.8446 37.2314
1 7 9 67 ... 40.5807 41.9554 45.9360
2 18 64 68 ... 37.8478 46.0047 47.8791
[3 rows x 10 columns]

deleting rows that have same date time

How can I get only row with the same minute? the seconds value do no matter. It seem like the row can be deleted using something like df.drop(index=2), but the data are so many to be deleted one over one.
import json
import math
from pandas.io.json import json_normalize
import pandas as pd
a=open(r'C:\work\kenkyuu\FITBIT\MyFitbitData (4)\AswadMdnor\user-site-export\heart_rate-2019-11-
17.json')
b=json.load(a)
df = json_normalize(b)
df = df.rename(columns={'value.bpm':'bpm','value.confidence':'confidence'})
print(df)
dateTime bpm confidence
11/17/19 02:28:05 113 0
11/17/19 02:28:17 70 0
11/17/19 02:28:31 70 0
11/17/19 02:28:42 70 0
11/17/19 02:29:29 70 0
11/17/19 02:29:46 70 0
11/17/19 02:30:43 70 0
11/17/19 02:32:13 70 0
11/17/19 02:49:39 70 0
i hope for this output:
dateTime bpm confidence
11/17/19 02:28:05 113 0
11/17/19 02:29:29 70 0
11/17/19 02:30:43 70 0
11/17/19 02:32:13 70 0
11/17/19 02:49:39 70 0
Here is the data as a dictionary, which you can use to recreate the DataFrame:
{'dateTime': {0: '11/17/19 02:28:05', 1: '11/17/19 02:28:17', 2: '11/17/19 02:28:31', 3: '11/17/19 02:28:42', 4: '11/17/19 02:29:29', 5: '11/17/19 02:29:46', 6: '11/17/19 02:30:43', 7: '11/17/19 02:32:13', 8: '11/17/19 02:49:39', 9: '11/17/19 02:49:49', 10: '11/17/19 02:49:54', 11: '11/17/19 02:49:59', 12: '11/17/19 02:50:04', 13: '11/17/19 02:50:09', 14: '11/17/19 02:50:14', 15: '11/17/19 02:50:24', 16: '11/17/19 02:50:29', 17: '11/17/19 02:50:34', 18: '11/17/19 02:50:39', 19: '11/17/19 02:50:44', 20: '11/17/19 02:50:49', 21: '11/17/19 02:51:04', 22: '11/17/19 02:51:09', 23: '11/17/19 03:04:05', 24: '11/17/19 03:04:33', 25: '11/17/19 11:14:27', 26: '11/17/19 11:14:42', 27: '11/17/19 11:14:52', 28: '11/17/19 11:15:01', 29: '11/17/19 11:15:06', 30: '11/17/19 11:15:21'}, 'bpm': {0: 113, 1: 70, 2: 70, 3: 70, 4: 70, 5: 70, 6: 70, 7: 70, 8: 70, 9: 67, 10: 62, 11: 57, 12: 58, 13: 60, 14: 60, 15: 62, 16: 63, 17: 65, 18: 66, 19: 67, 20: 65, 21: 66, 22: 67, 23: 69, 24: 70, 25: 70, 26: 70, 27: 70, 28: 70, 29: 70, 30: 70}, 'confidence': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 1, 10: 1, 11: 2, 12: 2, 13: 2, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1, 20: 1, 21: 1, 22: 1, 23: 0, 24: 0, 25: 0, 26: 0, 27: 1, 28: 1, 29: 0, 30: 1}}

I will round seconds then check duplicates and then subset or drop duplicates of the rounded datetime
df[~df['dateTime'].dt.round('min').duplicated()]

Here, we are dropping the duplicates by ignoring the seconds and take it's index values, to get original time with seconds as shown below.
>>> df.iloc[df['dateTime'].astype(str).str[:-2].drop_duplicates(keep='first').index,:]
Output:
dateTime bpm confidence
0 11/17/19 02:28:05 113 0
4 11/17/19 02:29:29 70 0
6 11/17/19 02:30:43 70 0
7 11/17/19 02:32:13 70 0
8 11/17/19 02:49:39 70 0

I believe this solution is the most idiomatic one, although I will keep searching.
import pandas as pd
df = pd.read_csv('../resources/fitbit_time_data.csv', dtype={'bpm': np.int64, 'confidence': np.int64}, parse_dates=['date_time'], names=['date_time', 'bpm', 'confidence'], skiprows=[0])
df = df.resample(rule='min', on='date_time').first().dropna().reset_index(drop=True)
Result:
date_time bpm confidence
0 2019-11-17 02:28:05 113.0 0.0
1 2019-11-17 02:29:29 70.0 0.0
2 2019-11-17 02:30:43 70.0 0.0
3 2019-11-17 02:32:13 70.0 0.0
4 2019-11-17 02:49:39 70.0 0.0
import pandas as pd
df = pd.read_csv('../resources/fitbit_time_data.csv', dtype={'bpm': np.int64, 'confidence': np.int64}, parse_dates=['date_time'], names=['date_time', 'bpm', 'confidence'], skiprows=[0])
df['minute'] = df.set_index('date_time').index.minute
df = df.loc[df['minute'].shift() != df['minute']]
DataFrame result:
date_time bpm confidence minute
0 2019-11-17 02:28:05 113 0 28
4 2019-11-17 02:29:29 70 0 29
6 2019-11-17 02:30:43 70 0 30
7 2019-11-17 02:32:13 70 0 32
8 2019-11-17 02:49:39 70 0 49

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to average across two dataframes - python

Related

Double header dataframe, sumif (possibly groupby?) with python

Pandas Resampling with delta time after specific starting time

Is there a formulaic approach to find the frequency of the sum of combinations?

How can I best create a new df using index values from another df that are used to retrieve multiple values?

deleting rows that have same date time

Categories

Resources