Getting combinations of elements from from different pandas rows - python
Assume that I have a dataframe like this:
Date Artist percent_gray percent_blue percent_black percent_red
33 Leonardo 22 33 36 46
45 Leonardo 23 47 23 14
46 Leonardo 13 34 33 12
23 Michelangelo 28 19 38 25
25 Michelangelo 24 56 55 13
26 Michelangelo 21 22 45 13
13 Titian 24 17 23 22
16 Titian 45 43 44 13
19 Titian 17 45 56 13
24 Raphael 34 34 34 45
27 Raphael 31 22 25 67
I want to get maximum color differences of different pictures for the same artist. I can also compare percent_gray with percent_blue e.g. for Lenoardo the biggest difference is percent_red (date:46) - percent_blue(date:45) =12 - 47 = -35. I wanna see how it evolves over time, so I just wanna compare new pictures of the same artist with the old ones(in this case I can compare third picture with first and second ones, and second picture only with first one) and get the maximum differences. So dataframe should look like
Date Artist max_d
33 Leonardo NaN
45 Leonardo -32
46 Leonardo -35
23 Michelangelo NaN
25 Michelangelo 37
26 Michelangelo -43
13 Titian NaN
16 Titian 28
19 Titian 43
24 Raphael NaN
27 Raphael 33
I think I have to use groupby but couldn't manage to get the output I want.
You can use:
#first sort in real data
df = df.sort_values(['Artist', 'Date'])
mi = df.iloc[:,2:].min(axis=1)
ma = df.iloc[:,2:].max(axis=1)
ma1 = ma.groupby(df['Artist']).shift()
mi1 = mi.groupby(df['Artist']).shift()
mad1 = mi - ma1
mad2 = ma - mi1
df['max_d'] = np.where(mad1.abs() > mad2.abs(), mad1, mad2)
print (df)
Date Artist percent_gray percent_blue percent_black \
0 33 Leonardo 22 33 36
1 45 Leonardo 23 47 23
2 46 Leonardo 13 34 33
3 23 Michelangelo 28 19 38
4 25 Michelangelo 24 56 55
5 26 Michelangelo 21 22 45
6 13 Titian 24 17 23
7 16 Titian 45 43 44
8 19 Titian 17 45 56
9 24 Raphael 34 34 34
10 27 Raphael 31 22 25
percent_red max_d
0 46 NaN
1 14 -32.0
2 12 -35.0
3 25 NaN
4 13 37.0
5 13 -43.0
6 22 NaN
7 13 28.0
8 13 43.0
9 45 NaN
10 67 33.0
Explanation (with new columns):
#get min and max per rows
df['min'] = df.iloc[:,2:].min(axis=1)
df['max'] = df.iloc[:,2:].max(axis=1)
#get shifted min and max by Artist
df['max1'] = df.groupby('Artist')['max'].shift()
df['min1'] = df.groupby('Artist')['min'].shift()
#get differences
df['max_d1'] = df['min'] - df['max1']
df['max_d2'] = df['max'] - df['min1']
#if else of absolute values
df['max_d'] = np.where(df['max_d1'].abs() > df['max_d2'].abs(), df['max_d1'], df['max_d2'])
print (df)
percent_red min max max1 min1 max_d1 max_d2 max_d
0 46 22 46 NaN NaN NaN NaN NaN
1 14 14 47 46.0 22.0 -32.0 25.0 -32.0
2 12 12 34 47.0 14.0 -35.0 20.0 -35.0
3 25 19 38 NaN NaN NaN NaN NaN
4 13 13 56 38.0 19.0 -25.0 37.0 37.0
5 13 13 45 56.0 13.0 -43.0 32.0 -43.0
6 22 17 24 NaN NaN NaN NaN NaN
7 13 13 45 24.0 17.0 -11.0 28.0 28.0
8 13 13 56 45.0 13.0 -32.0 43.0 43.0
9 45 34 45 NaN NaN NaN NaN NaN
10 67 22 67 45.0 34.0 -23.0 33.0 33.0
And if use second explanation solution, remove columns:
df = df.drop(['min','max','max1','min1','max_d1', 'max_d2'], axis=1)
print (df)
Date Artist percent_gray percent_blue percent_black \
0 33 Leonardo 22 33 36
1 45 Leonardo 23 47 23
2 46 Leonardo 13 34 33
3 23 Michelangelo 28 19 38
4 25 Michelangelo 24 56 55
5 26 Michelangelo 21 22 45
6 13 Titian 24 17 23
7 16 Titian 45 43 44
8 19 Titian 17 45 56
9 24 Raphael 34 34 34
10 27 Raphael 31 22 25
percent_red max_d
0 46 NaN
1 14 -32.0
2 12 -35.0
3 25 NaN
4 13 37.0
5 13 -43.0
6 22 NaN
7 13 28.0
8 13 43.0
9 45 NaN
10 67 33.0
How about a custom apply function. Does this work?
from operator import itemgetter
import pandas
import itertools
p = pandas.read_csv('Artits.tsv', sep='\s+')
def diff(x):
return x
def max_any_color(cols):
grey = []
blue = []
black = []
red = []
for row in cols.iterrows():
date = row[1]['Date']
grey.append(row[1]['percent_gray'])
blue.append(row[1]['percent_blue'])
black.append(row[1]['percent_black'])
red.append(row[1]['percent_red'])
gb = max([abs(a[0] - a[1]) for a in itertools.product(grey,blue)])
gblack = max([abs(a[0] - a[1]) for a in itertools.product(grey,black)])
gr = max([abs(a[0] - a[1]) for a in itertools.product(grey,red)])
bb = max([abs(a[0] - a[1]) for a in itertools.product(blue,black)])
br = max([abs(a[0] - a[1]) for a in itertools.product(blue,red)])
blackr = max([abs(a[0] - a[1]) for a in itertools.product(black,red)])
l = [gb,gblack,gr,bb,br,blackr]
c = ['grey/blue','grey/black','grey/red','blue/black','blue/red','black/red']
max_ = max(l)
between_colors_index = l.index(max_)
return c[between_colors_index], max_
p.groupby('Artist').apply(lambda x: max_any_color(x))
Output:
Leonardo (blue/red, 35)
Michelangelo (blue/red, 43)
Raphael (blue/red, 45)
Titian (black/red, 43)
Related
How to perform operations with columns from different datasets with different indexation?
The goal A bit of background, to get familiar with variables and understand what the problem is: floor, square, matc and volume are tables or dataframes, all share same column "id" (which simply goes from 1 to 100), so every row is unique; floor and square also share column "room_name"; volume is generally equivalent to floor, except all rows with rooms ("room_name") that have no values in "square" column of square dataframe were dropped; This implies that some values of "id" are missing That done, I needed to create a new column in volume dataframe, which would consist of multiplication of one of its own columns with two other columns from matc and square dataframes. The problem This seemingly simple interaction turned out to be quite difficult, because, well, the columns I am working with are of different length (except for square and matc, they are the same) and I need to align them by "id". To make matters worse, when called directly as volume['coefLoosening'] (please note that coefLoosening does not originate from floor and is added after the table is created), it returns a series with its own index and no way to relate it to "id". What I tried Whilst trying to solve the issue, I came up with this abomination: volume = volume.merge(pd.DataFrame({"id": matc.loc[matc["id"].isin(volume["id"])]["id"], "tempCoef": volume['coefLoosening'] * matc.loc[matc["id"].isin(volume["id"])]['width'] * square.loc[square["id"].isin(volume["id"])]['square']}), how = "left", on = ["id"]) This, however, misaligns "id" column completely, somehow creating more rows. For instance, this what `` returns: index id tempCoef 0 1.0 960.430612244898 1 2.0 4665.499999999999 2 NaN NaN 3 4.0 2425.44652173913 4 5.0 5764.964210526316 5 6.0 55201.68727272727 6 NaN NaN 7 NaN NaN 8 NaN NaN 9 10.0 1780.7208791208789 10 11.0 6075.385074626865 11 12.0 10400.94 12 13.0 31.378285714285713 13 NaN NaN 14 NaN NaN 15 NaN NaN 16 17.0 10505.431451612903 17 18.0 1208.994845360825 18 NaN NaN 19 NaN NaN 20 21.0 568.8900000000001 21 22.0 4275.416470588235 22 NaN NaN 23 NaN NaN 24 25.0 547.04 25 26.0 2090.666111111111 26 27.0 2096.88406779661 27 NaN NaN 28 29.0 8324.566547619048 29 NaN NaN 30 NaN NaN 31 NaN NaN 32 33.0 2459.8314736842103 33 34.0 2177.778461538461 34 35.0 166.1257142857143 35 36.0 1866.8492307692304 36 37.0 3598.1470588235293 37 38.0 21821.709411764703 38 NaN NaN 39 40.0 2999.248 40 41.0 980.3136 41 42.0 2641.3503947368426 42 NaN NaN 43 44.0 25829.878148148146 44 45.0 649.3632 45 46.0 10895.386666666667 46 NaN NaN 47 NaN NaN 48 49.0 825.9879310344828 49 50.0 15951.941666666671 50 51.0 2614.9343434343436 51 52.0 2462.30625 52 NaN NaN 53 NaN NaN 54 55.0 1366.8287671232877 55 56.0 307.38 56 57.0 11601.975 57 58.0 1002.5415730337081 58 59.0 2493.4532432432434 59 60.0 981.7482608695652 61 62.0 NaN 63 64.0 NaN 65 66.0 NaN 67 68.0 NaN 73 74.0 NaN 75 76.0 NaN 76 77.0 NaN 77 78.0 NaN 78 79.0 NaN 80 81.0 NaN 82 83.0 NaN 84 85.0 NaN 88 89.0 NaN 89 90.0 NaN 90 91.0 NaN 92 93.0 NaN 94 95.0 NaN 95 96.0 NaN 97 98.0 NaN 98 99.0 NaN 99 100.0 NaN For clarity, no values in any of columns in the operation have NaNs in them. This is what 'volume["coefLoosening"]` returns: 0 1.020408 1 1.515152 2 2.000000 3 4.347826 4 5.263158 5 9.090909 6 1.162791 7 1.149425 8 1.851852 9 1.098901 10 1.492537 11 2.083333 12 1.428571 13 1.010101 14 1.562500 15 3.448276 16 1.612903 17 1.030928 18 33.333333 19 1.000000 20 1.123596 21 1.960784 22 2.127660 23 2.857143 24 1.369863 25 1.111111 26 1.694915 27 1.492537 28 1.190476 29 1.818182 30 1.612903 31 12.500000 32 1.052632 33 3.846154 34 2.040816 35 1.098901 36 2.941176 37 2.941176 38 2.857143 39 1.111111 40 1.333333 41 1.315789 42 3.703704 43 3.703704 44 2.000000 45 33.333333 46 12.500000 47 1.149425 48 1.724138 49 4.166667 50 1.010101 51 1.041667 52 1.162791 53 3.225806 54 1.369863 55 1.666667 56 4.545455 57 1.123596 58 1.351351 59 2.173913 and finally, this is what volume["id"] returns (to compare to the result of «abomination»): 0 1 1 2 2 4 3 5 4 6 5 10 6 11 7 12 8 13 9 17 10 18 11 21 12 22 13 25 14 26 15 27 16 29 17 33 18 34 19 35 20 36 21 37 22 38 23 40 24 41 25 42 26 44 27 45 28 46 29 49 30 50 31 51 32 52 33 55 34 56 35 57 36 58 37 59 38 60 39 62 40 64 41 66 42 68 43 74 44 76 45 77 46 78 47 79 48 81 49 83 50 85 51 89 52 90 53 91 54 93 55 95 56 96 57 98 58 99 59 100 Some thoughts I believe, part of the problem is how pandas returns columns (as series with default indexation) and I don't know how to work around that. Another source of the problem might be the way how .loc() method returns its result. In the case of matc.loc[matc["id"].isin(volume["id"])]['width'] it is: 0 15.98 1 36.12 3 32.19 4 18.54 5 98.96 9 64.56 10 58.20 11 55.08 12 3.84 16 77.31 17 15.25 20 63.21 21 76.32 24 10.52 25 54.65 26 95.46 28 79.67 32 57.01 33 27.54 34 7.36 35 36.44 36 23.64 37 78.98 39 92.19 40 31.26 41 61.71 43 70.07 44 10.91 45 4.24 48 7.35 49 46.70 50 97.69 51 32.03 54 13.50 55 42.30 56 94.71 57 37.49 58 57.86 59 50.29 61 18.18 63 88.26 65 4.28 67 28.89 73 4.05 75 22.37 76 52.20 77 98.29 78 72.98 80 6.07 82 35.80 84 64.16 88 23.60 89 45.05 90 21.14 92 31.21 94 46.04 95 7.15 97 27.70 98 31.93 99 79.62 which is shifted by -1 and I don't see a way to change this manually. So, any ideas? Maybe there is answered analogue of this question (because I tried to search it before asking, but found nothing)? Data Minimal columns of tables required to replicate this (because stack overflow does not allow files to be uploaded) volume: index,id,room_name,coefLoosening 0,1,6,1.0204081632653061 1,2,7,1.5151515151515151 2,4,3,2.0 3,5,7,4.3478260869565215 4,6,4,5.2631578947368425 5,10,7,9.090909090909092 6,11,5,1.1627906976744187 7,12,4,1.1494252873563218 8,13,1,1.8518518518518516 9,17,3,1.0989010989010988 10,18,3,1.4925373134328357 11,21,3,2.0833333333333335 12,22,7,1.4285714285714286 13,25,3,1.0101010101010102 14,26,6,1.5625 15,27,6,3.4482758620689657 16,29,4,1.6129032258064517 17,33,2,1.0309278350515465 18,34,2,33.333333333333336 19,35,5,1.0 20,36,4,1.1235955056179776 21,37,2,1.9607843137254901 22,38,6,2.127659574468085 23,40,5,2.857142857142857 24,41,6,1.36986301369863 25,42,3,1.1111111111111112 26,44,2,1.6949152542372883 27,45,4,1.4925373134328357 28,46,2,1.1904761904761905 29,49,5,1.8181818181818181 30,50,4,1.6129032258064517 31,51,2,12.5 32,52,3,1.0526315789473684 33,55,6,3.846153846153846 34,56,5,2.0408163265306123 35,57,5,1.0989010989010988 36,58,4,2.941176470588235 37,59,5,2.941176470588235 38,60,5,2.857142857142857 39,62,7,1.1111111111111112 40,64,7,1.3333333333333333 41,66,7,1.3157894736842106 42,68,3,3.7037037037037033 43,74,5,3.7037037037037033 44,76,4,2.0 45,77,3,33.333333333333336 46,78,4,12.5 47,79,5,1.1494252873563218 48,81,5,1.7241379310344829 49,83,4,4.166666666666667 50,85,2,1.0101010101010102 51,89,4,1.0416666666666667 52,90,1,1.1627906976744187 53,91,2,3.2258064516129035 54,93,2,1.36986301369863 55,95,1,1.6666666666666667 56,96,4,4.545454545454546 57,98,7,1.1235955056179776 58,99,7,1.3513513513513513 59,100,5,2.1739130434782608 matc: index,id,width 0,1,15.98 1,2,36.12 2,3,63.41 3,4,32.19 4,5,18.54 5,6,98.96 6,7,5.65 7,8,97.42 8,9,50.88 9,10,64.56 10,11,58.2 11,12,55.08 12,13,3.84 13,14,75.87 14,15,96.51 15,16,42.08 16,17,77.31 17,18,15.25 18,19,81.43 19,20,98.71 20,21,63.21 21,22,76.32 22,23,22.59 23,24,30.79 24,25,10.52 25,26,54.65 26,27,95.46 27,28,49.93 28,29,79.67 29,30,45.0 30,31,59.14 31,32,62.25 32,33,57.01 33,34,27.54 34,35,7.36 35,36,36.44 36,37,23.64 37,38,78.98 38,39,47.8 39,40,92.19 40,41,31.26 41,42,61.71 42,43,93.11 43,44,70.07 44,45,10.91 45,46,4.24 46,47,35.39 47,48,99.1 48,49,7.35 49,50,46.7 50,51,97.69 51,52,32.03 52,53,48.61 53,54,33.44 54,55,13.5 55,56,42.3 56,57,94.71 57,58,37.49 58,59,57.86 59,60,50.29 60,61,77.98 61,62,18.18 62,63,3.42 63,64,88.26 64,65,48.66 65,66,4.28 66,67,20.78 67,68,28.89 68,69,27.17 69,70,57.48 70,71,59.07 71,72,12.63 72,73,22.06 73,74,4.05 74,75,22.3 75,76,22.37 76,77,52.2 77,78,98.29 78,79,72.98 79,80,49.37 80,81,6.07 81,82,28.85 82,83,35.8 83,84,66.74 84,85,64.16 85,86,33.64 86,87,66.36 87,88,34.51 88,89,23.6 89,90,45.05 90,91,21.14 91,92,97.27 92,93,31.21 93,94,13.04 94,95,46.04 95,96,7.15 96,97,47.87 97,98,27.7 98,99,31.93 99,100,79.62 square: index,id,room_name,square 0,1,5,58.9 1,2,3,85.25 2,3,5,90.39 3,4,3,17.33 4,5,2,59.08 5,6,4,61.36 6,7,2,29.02 7,8,2,59.63 8,9,6,98.31 9,10,4,25.1 10,11,3,69.94 11,12,7,90.64 12,13,4,5.72 13,14,6,29.96 14,15,4,59.06 15,16,1,41.85 16,17,7,84.25 17,18,4,76.9 18,19,1,17.2 19,20,4,60.9 20,21,1,8.01 21,22,2,28.57 22,23,1,65.07 23,24,1,20.24 24,25,7,37.96 25,26,7,34.43 26,27,3,12.96 27,28,6,80.96 28,29,5,87.77 29,30,2,95.67 30,31,1,10.4 31,32,1,30.96 32,33,6,40.99 33,34,7,20.56 34,35,5,11.06 35,36,4,46.62 36,37,3,51.75 37,38,4,93.94 38,39,5,62.64 39,40,6,29.28 40,41,3,23.52 41,42,6,32.53 42,43,1,33.3 43,44,3,99.53 44,45,5,29.76 45,46,7,77.09 46,47,1,71.31 47,48,2,59.22 48,49,1,65.18 49,50,7,81.98 50,51,7,26.5 51,52,3,73.8 52,53,6,78.52 53,54,6,69.67 54,55,6,73.91 55,56,6,4.36 56,57,5,26.95 57,58,2,23.8 58,59,2,31.89 59,60,1,8.98 60,61,1,88.76 61,62,5,88.75 62,63,4,44.94 63,64,4,81.13 64,65,5,48.39 65,66,3,55.63 66,67,7,46.28 67,68,3,40.85 68,69,7,54.37 69,70,3,14.01 70,71,6,20.13 71,72,2,90.67 72,73,3,4.28 73,74,4,56.18 74,75,3,74.8 75,76,5,10.34 76,77,6,15.94 77,78,2,29.4 78,79,6,60.8 79,80,3,13.05 80,81,3,49.46 81,82,1,75.76 82,83,1,84.27 83,84,5,76.36 84,85,3,75.98 85,86,7,77.81 86,87,2,56.34 87,88,1,43.93 88,89,5,30.64 89,90,5,55.78 90,91,5,88.26 91,92,6,15.11 92,93,1,20.64 93,94,2,5.08 94,95,1,82.31 95,96,4,76.92 96,97,1,53.47 97,98,2,2.7 98,99,7,77.12 99,100,4,29.43 floor: index,id,room_name 0,1,6 1,2,7 2,3,12 3,4,3 4,5,7 5,6,4 6,7,8 7,8,11 8,9,10 9,10,7 10,11,5 11,12,4 12,13,1 13,14,11 14,15,12 15,16,9 16,17,3 17,18,3 18,19,9 19,20,12 20,21,3 21,22,7 22,23,8 23,24,12 24,25,3 25,26,6 26,27,6 27,28,10 28,29,4 29,30,10 30,31,9 31,32,11 32,33,2 33,34,2 34,35,5 35,36,4 36,37,2 37,38,6 38,39,11 39,40,5 40,41,6 41,42,3 42,43,11 43,44,2 44,45,4 45,46,2 46,47,9 47,48,12 48,49,5 49,50,4 50,51,2 51,52,3 52,53,9 53,54,10 54,55,6 55,56,5 56,57,5 57,58,4 58,59,5 59,60,5 60,61,12 61,62,7 62,63,12 63,64,7 64,65,11 65,66,7 66,67,12 67,68,3 68,69,8 69,70,11 70,71,12 71,72,8 72,73,12 73,74,5 74,75,11 75,76,4 76,77,3 77,78,4 78,79,5 79,80,12 80,81,5 81,82,12 82,83,4 83,84,8 84,85,2 85,86,8 86,87,8 87,88,9 88,89,4 89,90,1 90,91,2 91,92,9 92,93,2 93,94,12 94,95,1 95,96,4 96,97,8 97,98,7 98,99,7 99,100,5
IIUC you overcomplicated things. The whole thing about merging on id is that you don't need to filter the other df's beforehand on id with loc and isin like you tried to do, merge will do that for you. You could multiply square and width at the square_df (matc_df would also work since they have same length and id). Then merge this new column to the volume_df (which filters the multiplied result only to the id's which are found in the volume_df) and multiply it again. square_df['square*width'] = square_df['square'] * matc_df['width'] df = volume_df.merge(square_df[['id', 'square*width']], on='id', how='left') df['result'] = df['coefLoosening'] * df['square*width'] Output df: id room_name coefLoosening square*width result 0 1 6 1.020408 941.2220 960.430612 1 2 7 1.515152 3079.2300 4665.500000 2 4 3 2.000000 557.8527 1115.705400 3 5 7 4.347826 1095.3432 4762.361739 4 6 4 5.263158 6072.1856 31958.871579 5 10 7 9.090909 1620.4560 14731.418182 6 11 5 1.162791 4070.5080 4733.148837 7 12 4 1.149425 4992.4512 5738.449655 8 13 1 1.851852 21.9648 40.675556 9 17 3 1.098901 6513.3675 7157.546703 10 18 3 1.492537 1172.7250 1750.335821 11 21 3 2.083333 506.3121 1054.816875 12 22 7 1.428571 2180.4624 3114.946286 13 25 3 1.010101 399.3392 403.372929 14 26 6 1.562500 1881.5995 2939.999219 15 27 6 3.448276 1237.1616 4266.074483 16 29 4 1.612903 6992.6359 11278.445000 17 33 2 1.030928 2336.8399 2409.113299 18 34 2 33.333333 566.2224 18874.080000 19 35 5 1.000000 81.4016 81.401600 20 36 4 1.123596 1698.8328 1908.800899 21 37 2 1.960784 1223.3700 2398.764706 22 38 6 2.127660 7419.3812 15785.917447 23 40 5 2.857143 2699.3232 7712.352000 24 41 6 1.369863 735.2352 1007.171507 25 42 3 1.111111 2007.4263 2230.473667 26 44 2 1.694915 6974.0671 11820.452712 27 45 4 1.492537 324.6816 484.599403 28 46 2 1.190476 326.8616 389.120952 29 49 5 1.818182 479.0730 871.041818 30 50 4 1.612903 3828.4660 6174.945161 31 51 2 12.500000 2588.7850 32359.812500 32 52 3 1.052632 2363.8140 2488.225263 33 55 6 3.846154 997.7850 3837.634615 34 56 5 2.040816 184.4280 376.383673 35 57 5 1.098901 2552.4345 2804.873077 36 58 4 2.941176 892.2620 2624.300000 37 59 5 2.941176 1845.1554 5426.927647 38 60 5 2.857143 451.6042 1290.297714 39 62 7 1.111111 1613.4750 1792.750000 40 64 7 1.333333 7160.5338 9547.378400 41 66 7 1.315789 238.0964 313.284737 42 68 3 3.703704 1180.1565 4370.950000 43 74 5 3.703704 227.5290 842.700000 44 76 4 2.000000 231.3058 462.611600 45 77 3 33.333333 832.0680 27735.600000 46 78 4 12.500000 2889.7260 36121.575000 47 79 5 1.149425 4437.1840 5100.211494 48 81 5 1.724138 300.2222 517.624483 49 83 4 4.166667 3016.8660 12570.275000 50 85 2 1.010101 4874.8768 4924.117980 51 89 4 1.041667 723.1040 753.233333 52 90 1 1.162791 2512.8890 2921.963953 53 91 2 3.225806 1865.8164 6018.762581 54 93 2 1.369863 644.1744 882.430685 55 95 1 1.666667 3789.5524 6315.920667 56 96 4 4.545455 549.9780 2499.900000 57 98 7 1.123596 74.7900 84.033708 58 99 7 1.351351 2462.4416 3327.623784 59 100 5 2.173913 2343.2166 5093.949130
Compute annual rate using a DataFrame and pct_change()
I have a column inside a DataFrame that I want to use in order to perform the operation: n = step/12 step = 3 t1 = step - 1 pd.DataFrame(100*((df[t1+step::step]['Column'].values / df[t1:-t1:step]['Column'].values)**(1/n) - 1)) A possible set of values for the column of interest could be: >>> df['Column'] 0 NaN 1 NaN 2 7469.5 3 NaN 4 NaN 5 7537.9 6 NaN 7 NaN 8 7655.2 9 NaN 10 NaN 11 7712.6 12 NaN 13 NaN 14 7784.1 15 NaN 16 NaN 17 7819.8 18 NaN 19 NaN 20 7898.6 21 NaN 22 NaN 23 7939.5 24 NaN 25 NaN 26 7995.0 27 NaN 28 NaN 29 8084.7 ... So df[t1+step::step]['Column'] would give us: >>> df[5::3]['Column'] 5 7537.9 8 7655.2 11 7712.6 14 7784.1 17 7819.8 20 7898.6 23 7939.5 26 7995.0 29 8084.7 32 8158.0 35 8292.7 38 8339.3 41 8449.5 44 8498.3 47 8610.9 50 8697.7 53 8766.1 56 8831.5 59 8850.2 62 8947.1 65 8981.7 68 8983.9 71 8907.4 74 8865.6 77 8934.4 80 8977.3 83 9016.4 86 9123.0 89 9223.5 92 9313.2 ... And lastly df[t1:-t1:step]['Column'] >>> df[2:-2:3]['Column'] 2 7469.5 5 7537.9 8 7655.2 11 7712.6 14 7784.1 17 7819.8 20 7898.6 23 7939.5 26 7995.0 29 8084.7 32 8158.0 35 8292.7 38 8339.3 41 8449.5 44 8498.3 47 8610.9 50 8697.7 53 8766.1 56 8831.5 59 8850.2 62 8947.1 65 8981.7 68 8983.9 71 8907.4 74 8865.6 77 8934.4 80 8977.3 83 9016.4 86 9123.0 89 9223.5 ... With these values what we expect is the following output: >>> pd.DataFrame(100*((df[5::3]['Column'].values / df[2:-2:3]['Column'].values)**4 -1)) 0 3.713517 1 6.371352 2 3.033171 3 3.760103 4 1.847168 5 4.092131 6 2.087397 7 2.825602 8 4.563898 9 3.676223 10 6.769944 11 2.266778 12 5.391516 13 2.330287 14 5.406150 15 4.093476 16 3.182961 17 3.017786 18 0.849662 19 4.452016 20 1.555866 21 0.098013 22 -3.362834 23 -1.863919 24 3.140454 25 1.934544 26 1.753587 27 4.813692 28 4.479794 29 3.947179 Since this reminds a lot of what pct_change() does I was wondering if I could achieve the same result by doing something like: >>> df['Column'].pct_change(periods=step)**(1/n) * 100 Until now I am getting incorrect outputs though. Is it possible to use pct_change() and achieve the same result?
Choosing the larger probability from a specific indexID
I have a database as follows: indexID matchID order userClean Probability 0 0 1 0 clean 35 1 0 2 1 clean 75 2 0 2 2 clean 25 5 3 4 5 clean 40 6 3 5 6 clean 85 9 4 5 9 clean 74 12 6 7 12 clean 23 13 6 8 13 clean 72 14 7 8 14 clean 85 15 9 10 15 clean 76 16 10 11 16 clean 91 19 13 14 19 clean 27 23 13 17 23 clean 10 28 13 18 28 clean 71 32 20 21 32 clean 97 33 20 22 33 clean 30 What I want to do is, for each repeated indexID, I would like to choose the entry that is of higher probability and mark that as clean and the other as dirty. The output should look something like this: indexID matchID order userClean Probability 0 0 1 0 dirty 35 1 0 2 1 clean 75 2 0 2 2 dirty 25 5 3 4 5 dirty 40 6 3 5 6 clean 85 9 4 5 9 clean 74 12 6 7 12 dirty 23 13 6 8 13 clean 72 14 7 8 14 clean 85 15 9 10 15 clean 76 16 10 11 16 clean 91 19 13 14 19 dirty 27 23 13 17 23 dirty 10 28 13 18 28 clean 71 32 20 21 32 clean 97 33 20 22 33 dirty 30
If need pandas solution create boolean mask by comparing Probability column by Series.ne (!=) with max values per groups created by transform, because need Series with same size as df: mask = df['Probability'].ne(df.groupby('indexID')['Probability'].transform('max')) df.loc[mask, 'userClean'] = 'dirty' print (df) indexID matchID order userClean Probability 0 0 1 0 dirty 35 1 0 2 1 clean 75 2 0 2 2 dirty 25 5 3 4 5 dirty 40 6 3 5 6 clean 85 9 4 5 9 clean 74 12 6 7 12 dirty 23 13 6 8 13 clean 72 14 7 8 14 clean 85 15 9 10 15 clean 76 16 10 11 16 clean 91 19 13 14 19 dirty 27 23 13 17 23 dirty 10 28 13 18 28 clean 71 32 20 21 32 clean 97 33 20 22 33 dirty 30 Detail: print (df.groupby('indexID')['Probability'].transform('max')) 0 75 1 75 2 75 5 85 6 85 9 74 12 72 13 72 14 85 15 76 16 91 19 71 23 71 28 71 32 97 33 97 Name: Probability, dtype: int64 If want compare mean with gt (>): mask = df['Probability'].gt(df['Probability'].mean()) df.loc[mask, 'userClean'] = 'dirty' print (df) indexID matchID order userClean Probability 0 0 1 0 clean 35 1 0 2 1 dirty 75 2 0 2 2 clean 25 5 3 4 5 clean 40 6 3 5 6 dirty 85 9 4 5 9 dirty 74 12 6 7 12 clean 23 13 6 8 13 dirty 72 14 7 8 14 dirty 85 15 9 10 15 dirty 76 16 10 11 16 dirty 91 19 13 14 19 clean 27 23 13 17 23 clean 10 28 13 18 28 dirty 71 32 20 21 32 dirty 97 33 20 22 33 clean 30
How to interpolate data and angles with PANDAS
I have a simple dataframe df that contains three columns: Time: expressed in seconds A: set of values that can vary between -inf to +inf B: set of angles (degrees) which range between 0 and 359 Here is the dataframe df = pd.DataFrame({'Time':[0,12,23,25,44,50], 'A':[5,7,9,8,11,6], 'B':[300,358,4,10,2,350]}) And it looks like this: Time A B 0 0 5 300 1 12 7 358 2 23 9 4 3 25 8 10 4 44 11 2 5 50 6 350 My idea is to interpolate the data from 0 to 50 seconds and I was able to achieve my goal using the following lines of code: y = pd.DataFrame({'Time':list(range(df['Time'].iloc[0], df['Time'].iloc[-1]))}) df = pd.merge(left=y, right=df, on='Time', how='left').interpolate() Problem: even though column A is interpolated correctly, column B is wrong because the interpolation of an angle between 360 degrees is not performed! Here is an example: Time A B 12 12 7.000000 358.000000 13 13 7.181818 325.818182 14 14 7.363636 293.636364 15 15 7.545455 261.454545 16 16 7.727273 229.272727 17 17 7.909091 197.090909 18 18 8.090909 164.909091 19 19 8.272727 132.727273 20 20 8.454545 100.545455 21 21 8.636364 68.363636 22 22 8.818182 36.181818 23 23 9.000000 4.000000 Question: can you suggest me a smart and efficient way to solve this issue and being able to interpolate correctly the angles between 0/360 degrees?
You should be able to use the method described in this question for the angle column: import numpy as np import pandas as pd df = pd.DataFrame({'Time':[0,12,23,25,44,50], 'A':[5,7,9,8,11,6], 'B':[300,358,4,10,2,350]}) df['B'] = np.rad2deg(np.unwrap(np.deg2rad(df['B']))) y = pd.DataFrame({'Time':list(range(df['Time'].iloc[0], df['Time'].iloc[-1]))}) df = pd.merge(left=y, right=df, on='Time', how='left').interpolate() df['B'] %= 360 print(df) Output: Time A B 0 0 5.000000 300.000000 1 1 5.166667 304.833333 2 2 5.333333 309.666667 3 3 5.500000 314.500000 4 4 5.666667 319.333333 5 5 5.833333 324.166667 6 6 6.000000 329.000000 7 7 6.166667 333.833333 8 8 6.333333 338.666667 9 9 6.500000 343.500000 10 10 6.666667 348.333333 11 11 6.833333 353.166667 12 12 7.000000 358.000000 13 13 7.181818 358.545455 14 14 7.363636 359.090909 15 15 7.545455 359.636364 16 16 7.727273 0.181818 17 17 7.909091 0.727273 18 18 8.090909 1.272727 19 19 8.272727 1.818182 20 20 8.454545 2.363636 21 21 8.636364 2.909091 22 22 8.818182 3.454545 23 23 9.000000 4.000000 24 24 8.500000 7.000000 25 25 8.000000 10.000000 26 26 8.157895 9.578947 27 27 8.315789 9.157895 28 28 8.473684 8.736842 29 29 8.631579 8.315789 30 30 8.789474 7.894737 31 31 8.947368 7.473684 32 32 9.105263 7.052632 33 33 9.263158 6.631579 34 34 9.421053 6.210526 35 35 9.578947 5.789474 36 36 9.736842 5.368421 37 37 9.894737 4.947368 38 38 10.052632 4.526316 39 39 10.210526 4.105263 40 40 10.368421 3.684211 41 41 10.526316 3.263158 42 42 10.684211 2.842105 43 43 10.842105 2.421053 44 44 11.000000 2.000000 45 45 11.000000 2.000000 46 46 11.000000 2.000000 47 47 11.000000 2.000000 48 48 11.000000 2.000000 49 49 11.000000 2.000000
Columns located within a column
I am trying to extract a dataframe from a web api and can't seem to work out how to break columns out. For Home and Away, they have breakdowns inside them, so should read Home Wins, Home Draws etc. url = "http://api.football-data.org/v1/soccerseasons/398/leagueTable/?matchday=38" response = requests.get(url) response_json = response.content result = json.loads(response_json) football = pd.DataFrame(result['standing'], columns=['position','teamName','playedGames','wins','draws','losses','goals', 'goalsAgainst','home','away','goalDifference','points']) football football.home this shows the problem: 0 {u'wins': 12, u'losses': 1, u'draws': 6, u'goa...
I think you can use json_normalize: import pandas as pd import json import requests from pandas.io.json import json_normalize url = "http://api.football-data.org/v1/soccerseasons/398/leagueTable/?matchday=38" result = json.loads(requests.get(url).text) #print (result) df = json_normalize(result["standing"]) print (df) _links.team.href away.draws away.goals \ 0 http://api.football-data.org/v1/teams/338 6 33 1 http://api.football-data.org/v1/teams/57 7 34 2 http://api.football-data.org/v1/teams/73 7 34 3 http://api.football-data.org/v1/teams/65 7 24 4 http://api.football-data.org/v1/teams/66 4 22 5 http://api.football-data.org/v1/teams/340 6 20 6 http://api.football-data.org/v1/teams/563 7 31 7 http://api.football-data.org/v1/teams/64 4 30 8 http://api.football-data.org/v1/teams/70 5 19 9 http://api.football-data.org/v1/teams/61 5 27 10 http://api.football-data.org/v1/teams/62 9 24 11 http://api.football-data.org/v1/teams/72 5 22 12 http://api.football-data.org/v1/teams/346 3 20 13 http://api.football-data.org/v1/teams/74 8 14 14 http://api.football-data.org/v1/teams/354 6 20 15 http://api.football-data.org/v1/teams/1044 4 22 16 http://api.football-data.org/v1/teams/71 6 25 17 http://api.football-data.org/v1/teams/67 3 12 18 http://api.football-data.org/v1/teams/68 2 13 19 http://api.football-data.org/v1/teams/58 3 13 away.goalsAgainst away.losses away.wins \ 0 18 2 11 1 25 4 8 2 20 3 9 3 20 5 7 4 26 8 7 5 19 6 7 6 25 5 7 7 28 7 8 8 31 8 6 9 23 7 7 10 25 5 5 11 32 10 4 12 31 10 6 13 22 7 4 14 28 8 5 15 33 9 6 16 42 10 3 17 41 14 2 18 37 14 3 19 41 15 1 crestURI draws goalDifference \ 0 http://upload.wikimedia.org/wikipedia/en/6/63/... 12 32 1 http://upload.wikimedia.org/wikipedia/en/5/53/... 11 29 2 http://upload.wikimedia.org/wikipedia/de/b/b4/... 13 34 3 http://upload.wikimedia.org/wikipedia/de/f/fd/... 9 30 4 http://upload.wikimedia.org/wikipedia/de/d/da/... 9 14 5 http://upload.wikimedia.org/wikipedia/de/c/c9/... 9 18 6 http://upload.wikimedia.org/wikipedia/de/e/e0/... 14 14 7 http://upload.wikimedia.org/wikipedia/de/0/0a/... 12 13 8 http://upload.wikimedia.org/wikipedia/de/a/a3/... 9 -14 9 http://upload.wikimedia.org/wikipedia/de/5/5c/... 14 6 10 http://upload.wikimedia.org/wikipedia/de/f/f9/... 14 4 11 http://upload.wikimedia.org/wikipedia/de/a/ab/... 11 -10 12 https://upload.wikimedia.org/wikipedia/en/e/e2... 9 -10 13 http://upload.wikimedia.org/wikipedia/de/8/8b/... 13 -14 14 http://upload.wikimedia.org/wikipedia/de/b/bf/... 9 -12 15 https://upload.wikimedia.org/wikipedia/de/4/41... 9 -22 16 http://upload.wikimedia.org/wikipedia/de/6/60/... 12 -14 17 http://upload.wikimedia.org/wikipedia/de/5/56/... 10 -21 18 http://upload.wikimedia.org/wikipedia/de/8/8c/... 7 -28 19 http://upload.wikimedia.org/wikipedia/de/9/9f/... 8 -49 goals ... home.goals home.goalsAgainst home.losses home.wins \ 0 68 ... 35 18 1 12 1 65 ... 31 11 3 12 2 69 ... 35 15 3 10 3 71 ... 47 21 5 12 4 49 ... 27 9 2 12 5 59 ... 39 22 5 11 6 65 ... 34 26 3 9 7 63 ... 33 22 3 8 8 41 ... 22 24 7 8 9 59 ... 32 30 5 5 10 59 ... 35 30 8 6 11 42 ... 20 20 5 8 12 40 ... 20 19 7 6 13 34 ... 20 26 8 6 14 39 ... 19 23 10 6 15 45 ... 23 34 9 5 16 48 ... 23 20 7 6 17 44 ... 32 24 5 7 18 39 ... 26 30 8 6 19 27 ... 14 35 12 2 losses playedGames points position teamName wins 0 3 38 81 1 Leicester City FC 23 1 7 38 71 2 Arsenal FC 20 2 6 38 70 3 Tottenham Hotspur FC 19 3 10 38 66 4 Manchester City FC 19 4 10 38 66 5 Manchester United FC 19 5 11 38 63 6 Southampton FC 18 6 8 38 62 7 West Ham United FC 16 7 10 38 60 8 Liverpool FC 16 8 15 38 51 9 Stoke City FC 14 9 12 38 50 10 Chelsea FC 12 10 13 38 47 11 Everton FC 11 11 15 38 47 12 Swansea City FC 12 12 17 38 45 13 Watford FC 12 13 15 38 43 14 West Bromwich Albion FC 10 14 18 38 42 15 Crystal Palace FC 11 15 18 38 42 16 AFC Bournemouth 11 16 17 38 39 17 Sunderland AFC 9 17 19 38 37 18 Newcastle United FC 9 18 22 38 34 19 Norwich City FC 9 19 27 38 17 20 Aston Villa FC 3 [20 rows x 22 columns]