Related
I have a pandas dataframe that looks like this:
X[m] Y[m] Z[m] ... beta newx newy
0 1.439485 0.087100 0.029771 ... 0.063807 1439 87
1 1.439485 0.089729 0.029121 ... 0.065871 1439 89
2 1.439485 0.091992 0.030059 ... 0.067653 1439 91
3 1.439485 0.082073 0.030721 ... 0.059883 1439 82
4 1.439485 0.084095 0.028952 ... 0.061458 1439 84
5 1.439485 0.085937 0.028019 ... 0.062897 1439 85
There are hundreds of thousands of such lines, while I have multiple dataframes like this. X and Y are coordinates on plane (Z is not important) that is moved 45 degrees by the middle to the right. I need to put all points back to the original place, -45 degrees from its location. I have variables newx and newy that represent coordinates before changing, I want to edit these two columns to have values of new coordinates. As I know coordinates of middle point, the point itself, the angle of middle-to-point (alpha) and angle middle-to-fixedpoint (beta), I can use approach presented in mathematics SO. I have transformed the code to python like this:
for i in range(len(df)):
if df.iloc[i].alpha == math.pi/2 or df.iloc[i].alpha == 3*math.pi/2:
df.newx[i] = mid
df.newy[i] = int(math.tan(df.iloc[i].beta*(df.iloc[i].x-mid)+mid))
elif df.iloc[i].beta == math.pi/2 or df.iloc[i].beta == 3*math.pi/2:
#df.newx[i] = df.iloc[i].x -- this is already set
df.newy[i] = int(math.tan(df.iloc[i].alpha*(mid-df.iloc[i].x)+mid))
else:
m0 = math.tan(df.iloc[i].alpha)
m1 = math.tan(df.iloc[i].beta)
x = ((m0 * df.iloc[i].x - m1 * mid) - (df.iloc[i].y - mid)) / (m0 - m1)
df.newx[i] = int(x)
df.newy[i] = int(m0 * (x - df.iloc[i].x) + df.iloc[i].y)
Although this does what I need and moves the point to the correct position, the time complexity is enormous and I have too much files to proceed it like this. I know that there are way faster methods, such as serialization, apply and list comprehension. I however can't figure out how to use it with this function.
Here are first 10 lines as dictionary:
{'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}, 'newx': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'newy': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}}
I suspect how we're using mid as from your code may be causing you problems. Is mid a numeric? Are the x- and y-coordinates of your middle point the same value?
#OP, please confirm your variable names as compared to your linked source are as I have translated them:
linked name
your name
a0
beta
a1
alpha
(x0, y0)
(df.x, df.y)
(x1, y1)
(mid, mid)
Update this answer shares some ideas with #mitoRibo's answer, but I re-translated from OP's linked source and suspect OP made some transcription error. Noted in comments. Both of us used a strategy of "selectively calculate newx/newy using masking, where the masks are equivalent to the if/elif/else conditions provided".
#setup
import pandas as pd
import numpy as np
import math
df = pd.DataFrame({'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}})
# make the new columns
df['newx'] = np.nan
df['newy'] = np.nan
# if any of the values are np.nan when we're done, something went wrong
# Do the float `between` comparison but cleverly
EPSILON = 1e-6
# windows = ((pi/2 ± EPSILON), (3pi/2 ± EPSILON))
windows = tuple(tuple(d*math.pi/2 + s*EPSILON for s in (1, -1)) for d in (1, 3))
# challenge: make this more DRY (don't repeat yourself)
alpha_cond = sum([df.alpha.between(*w) for w in windows]).astype(bool)
beta_cond = sum([ df.beta.between(*w) for w in windows]).astype(bool)\
& ~alpha_cond
neither = (~alpha_cond & ~beta_cond)
# Handle `alpha near pi/2 or 3pi/2`:
c1 = df.loc[alpha_cond]
df.loc[alpha_cond,'newx'] = mid
# changed `tan` parenthesis
# | changed `df.x - mid` to `mid - df.x`
# | | changed to `df.y` from `mid`
df.loc[alpha_cond,'newy'] = (np.tan(c1.beta) * (mid - c1.x) + c1.y).astype(int)
# Handle `beta near pi/2 or 3pi/2`:
c2 = df.loc[beta_cond]
df loc[beta_cond,'newx'] = c2.x
# changed `tan` parenthesis
# | changed `mid - df.x` to `df.x - mid`
df.loc[beta_cond,'newy'] = (np.tan(c2.alpha) * (c2.x - mid) + mid).astype(int)
# Handle the remainder:
c3 = df.loc[neither]
m0 = np.tan(c3.alpha)
m1 = np.tan(c3.beta)
t = ((m0 * c3.x - m1 * mid) - (c3.y - mid)) / (m0 - m1)
df.loc[neither,'newx'] = t.astype(int)
df.loc[neither,'newy'] = (m0 * (t - c3.x) + c3.y).astype(int)
Same approach as #Joshua Voskamp, but I still wanted to share
import pandas as pd
import numpy as np
import math
df = pd.DataFrame({'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}, 'newx': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'newy': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}})
mid = 0 #not sure what mid value should be
near_threshold = 0.001
alpha_near_half_pi = df.alpha.sub(math.pi/2).abs().le(near_threshold)
alpha_near_three_half_pi = df.alpha.sub(3*math.pi/2).abs().le(near_threshold)
beta_near_half_pi = df.beta.sub(math.pi/2).abs().le(near_threshold)
beta_near_three_half_pi = df.beta.sub(3*math.pi/2).abs().le(near_threshold)
cond1 = alpha_near_half_pi | alpha_near_three_half_pi
cond2 = beta_near_half_pi | beta_near_three_half_pi
cond2 = cond2 & (~cond1) #if cond1 is true, we don't want to do cond2
cond3 = ~(cond1 | cond2) #if neither cond1 nor cond2, then we are in cond3
#Process cond1 rows
c1 = df.loc[cond1]
df.loc[cond1,'newx'] = mid
df.loc[cond1,'newy'] = np.tan(c1.beta*(c1.x-mid)+mid)
#Process cond2 rows
c2 = df.loc[cond2]
df.loc[cond2,'newy'] = np.tan(c2.alpha*(mid-c2.x)+mid)
#Process cond3 rows
c3 = df.loc[cond3]
m0 = np.tan(c3.alpha)
m1 = np.tan(c3.beta)
# Is this a mistake? always 0?
# |
# --------------
x = ((m0 * c3.x - m1 * mid) - (c3.y - c3.y)) / (m0 - m1)
df.loc[cond3,'newx'] = x.astype(int)
df.loc[cond3,'newy'] = (m0 * (x - c3.x) + c3.y).astype(int)
df
So here is an image of what I have and what I want to get: https://imgur.com/a/RyDbvZD
Basically Those are SUMIF formulas in excel, I would like to recreate that in python, I was trying with pandas groupby().sum() function but I have no clue how to groupby on 2 headers like this, and then how to order the data.
Original dataframe:
df = pd.DataFrame( {'Group': {0: 'Name', 1: 20201001, 2: 20201002, 3: 20201003, 4: 20201004, 5: 20201005, 6: 20201006, 7: 20201007, 8: 20201008, 9: 20201009, 10: 20201010}, 'Credit': {0: 'Credit', 1: 65, 2: 69, 3: 92, 4: 18, 5: 58, 6: 12, 7: 31, 8: 29, 9: 12, 10: 41}, 'Equity': {0: 'Stock', 1: 92, 2: 62, 3: 54, 4: 52, 5: 14, 6: 5, 7: 14, 8: 17, 9: 54, 10: 51}, 'Equity.1': {0: 'Option', 1: 87, 2: 30, 3: 40, 4: 24, 5: 95, 6: 77, 7: 44, 8: 77, 9: 88, 10: 85}, 'Credit.1': {0: 'Credit', 1: 62, 2: 60, 3: 91, 4: 57, 5: 65, 6: 50, 7: 75, 8: 55, 9: 48, 10: 99}, 'Equity.2': {0: 'Option', 1: 61, 2: 91, 3: 38, 4: 3, 5: 71, 6: 51, 7: 74, 8: 41, 9: 59, 10: 31}, 'Bond': {0: 'Bond', 1: 4, 2: 62, 3: 91, 4: 66, 5: 30, 6: 51, 7: 76, 8: 6, 9: 65, 10: 73}, 'Unnamed: 7': {0: 'Stock', 1: 54, 2: 23, 3: 74, 4: 92, 5: 36, 6: 89, 7: 88, 8: 32, 9: 19, 10: 91}, 'Bond.1': {0: 'Bond', 1: 96, 2: 10, 3: 11, 4: 7, 5: 28, 6: 82, 7: 13, 8: 46, 9: 70, 10: 46}, 'Bond.2': {0: 'Bond', 1: 25, 2: 53, 3: 96, 4: 70, 5: 52, 6: 9, 7: 98, 8: 9, 9: 48, 10: 58}, 'Unnamed: 10': {0: float('nan'), 1: 63.0, 2: 80.0, 3: 17.0, 4: 21.0, 5: 30.0, 6: 78.0, 7: 23.0, 8: 31.0, 9: 72.0, 10: 65.0}} )
What I want at the end:
df = pd.DataFrame( {'Group': {0: 20201001, 1: 20201002, 2: 20201003, 3: 20201004, 4: 20201005, 5: 20201006, 6: 20201007, 7: 20201008, 8: 20201009, 9: 20201010}, 'Credit': {0: 127, 1: 129, 2: 183, 3: 75, 4: 123, 5: 62, 6: 106, 7: 84, 8: 60, 9: 140}, 'Equity': {0: 240, 1: 183, 2: 132, 3: 79, 4: 180, 5: 133, 6: 132, 7: 135, 8: 201, 9: 167}, 'Stock': {0: 146, 1: 85, 2: 128, 3: 144, 4: 50, 5: 94, 6: 102, 7: 49, 8: 73, 9: 142}, 'Option': {0: 148, 1: 121, 2: 78, 3: 27, 4: 166, 5: 128, 6: 118, 7: 118, 8: 147, 9: 116}} )
Any ideas where to start on this, or anything is appreciated
Here you go. First row seems to be the real headers so we first move that to column names and set the index to Name
df2 = df.rename(columns = df.loc[0]).drop(index = 0).set_index(['Name'])
Then we groupby by columns and sum
df2.groupby(df2.columns, axis=1, sort = False).sum().reset_index()
and we get
Name Credit Stock Option Bond
0 20201001 127.0 146.0 148.0 125.0
1 20201002 129.0 85.0 121.0 125.0
2 20201003 183.0 128.0 78.0 198.0
3 20201004 75.0 144.0 27.0 143.0
4 20201005 123.0 50.0 166.0 110.0
5 20201006 62.0 94.0 128.0 142.0
6 20201007 106.0 102.0 118.0 187.0
7 20201008 84.0 49.0 118.0 61.0
8 20201009 60.0 73.0 147.0 183.0
9 20201010 140.0 142.0 116.0 177.0
I realise the output is not exactly what you asked for but since we cannot see your SUMIF formulas, I do not know which columns you want to aggregate
Edit
Following up on your comment, I note that, as far as I can tell, the rules for aggregation are somewhat messy so that the same column is included in more than one output column (like Equity.1). I do not think there is much you can do with automation here, and you can replicate your SUMIF experience by directly referencing the columns you want to add. So I think the following gives you what you want
df = df.drop(index =0)
df2 = df[['Group']].copy()
df2['Credit'] = df['Credit'] + df['Credit.1']
df2['Equity'] = df['Equity'] + df['Equity.1']+ df['Equity.2']
df2['Stock'] = df['Equity'] + df['Unnamed: 7']
df2['Option'] = df['Equity.1'] + df['Equity.2']
df2
produces
Group Credit Equity Stock Option
-- -------- -------- -------- ------- --------
1 20201001 127 240 146 148
2 20201002 129 183 85 121
3 20201003 183 132 128 78
4 20201004 75 79 144 27
5 20201005 123 180 50 166
6 20201006 62 133 94 128
7 20201007 106 132 102 118
8 20201008 84 135 49 118
9 20201009 60 201 73 147
10 20201010 140 167 142 116
This also gives you control over which columns to include in the final output
If you want this more automated than you need to do something about labels of your columns, as you would want a unique label for a set of columns you want to aggregate. If the same input column is used in more than one calculation it is probably easiest to just duplicate it with the right labels
If I have df1
df1 = pd.DataFrame({'Col_Name': {0: 'A', 1: 'b', 2: 'c'}, 'X': {0: 12, 1: 23, 2: 223}, 'Z': {0: 42, 1: 33, 2: 28 }})
and df2
df2 = pd.DataFrame({'Col': {0: 'Y', 1: 'X', 2: 'Z'}, 'Low1': {0: 0, 1: 0, 2: 0}, 'High1': {0: 10, 1: 10, 2: 630}, 'Low2': {0: 10, 1: 10, 2: 630}, 'High2': {0: 50, 1: 50, 2: 3000}, 'Low3': {0: 50, 1: 50, 2: 3000}, 'High3': {0: 100, 1: 100, 2: 8500}, 'Low4': {0: 100, 1: 100, 2: 8500}, 'High4': {0: 'np.inf', 1: 'np.inf', 2: 'np.inf'}})
Select the row values of df2 if row is present in column name of df1.
Expected Output: df3
df3 = pd.DataFrame({'Col': {0: 'X', 1: 'Z'}, 'Low1': {0: 0, 1: 0}, 'High1': {0: 10, 1: 630}, 'Low2': {0: 10, 1: 630}, 'High2': {0: 50, 1: 3000}, 'Low3': {0: 50, 1: 3000}, 'High3': {0: 100, 1: 8500}, 'Low4': {0: 100, 1: 8500}, 'High4': {0: 'np.inf', 1: 'np.inf'}})
How to do it?
You can pass a boolean list to select the rows of df2 that you want. This list can be created by looking at each value in the Col column and asking if the value is in the columns of df1
df3 = df2[[col in df1.columns for col in df2['Col']]]
you can drop the non-relevant col and use the other columns...
df3 = df2[df2['Col'].isin(list(df1.drop('Col_Name',axis=1).columns))]
Let's say my dataframe looks like this.
date app_id country val1 val2 val3 val4
2016-01-01 123 US 50 70 80 90
2016-01-02 123 US 60 80 90 100
2016-01-03 123 US 70 88 99 11
I want to dump this into a nested dictionary or even a JSON object as follows:
{
country:
{
app_id:
{
date: [val1, val2, val3, val4]
}
}
}
So that way if I called my_dict['US'[123['2016-01-01']]], I would get to the list [50,70,80,90]
Is there an elegant way to go about doing this? I'm aware of Pandas's to_dict() function but I can't seem to get around nesting dictionaries.
1st create the dataframe you need. then using recur_dictify from DSM.
dd=df.groupby(['country','app_id','date'],as_index=False)['val1', 'val2', 'val3', 'val4'].apply(lambda x : x.values.tolist()[0]).to_frame()
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.iloc[:,1:]) for k,g in grouped}
return d
recur_dictify(dd.reset_index())
Out[711]:
{'US': {123: {'2016-01-01': [50, 70, 80, 90],
'2016-01-02': [60, 80, 90, 100],
'2016-01-03': [70, 88, 99, 11]}}}
update
Actually this might work with a simple nested dictionary:
import pandas as pd
from collections import defaultdict
nested_dict = lambda: defaultdict(nested_dict)
output = nested_dict()
for lst in df.values:
output[lst[1]][lst[0]][lst[2]] = lst[3:].tolist()
Or:
output = defaultdict(dict)
for lst in df.values:
try:
output[lst[1]][lst[0]].update({lst[2]:lst[3:].tolist()})
except KeyError:
output[lst[1]][lst[0]] = {}
finally:
output[lst[1]][lst[0]].update({lst[2]:lst[3:].tolist()})
Or:
output = defaultdict(dict)
for lst in df.values:
if output.get(lst[1], {}).get(lst[0]) == None:
output[lst[1]][lst[0]] = {}
output[lst[1]][lst[0]].update({lst[2]:lst[3:].tolist()})
output
Here is my old solution, we make use df.groupbyto group the dataframe by country and app_id. From here we collect the data (excluding country and app_id) and use defaultdict(dict) to add data to output dictionary in a nested way.
import pandas as pd
from collections import defaultdict
output = defaultdict(dict)
groups = ["country","app_id"]
cols = [i for i in df.columns if i not in groups]
for i,subdf in df.groupby(groups):
data = subdf[cols].set_index('date').to_dict("split") #filter away unwanted cols
d = dict(zip(data['index'],data['data']))
output[i[0]][i[1]] = d # assign country=level1, app_id=level2
output
return:
{'FR': {123: {'2016-01-01': [10, 20, 30, 40]}},
'US': {123: {'2016-01-01': [50, 70, 80, 90],
'2016-01-02': [60, 80, 90, 100],
'2016-01-03': [70, 88, 99, 11]},
124: {'2016-01-01': [10, 20, 30, 40]}}}
and output['US'][123]['2016-01-01'] return:
[50, 70, 80, 90]
if:
df = pd.DataFrame.from_dict({'app_id': {0: 123, 1: 123, 2: 123, 3: 123, 4: 124},
'country': {0: 'US', 1: 'US', 2: 'US', 3: 'FR', 4: 'US'},
'date': {0: '2016-01-01',
1: '2016-01-02',
2: '2016-01-03',
3: '2016-01-01',
4: '2016-01-01'},
'val1': {0: 50, 1: 60, 2: 70, 3: 10, 4: 10},
'val2': {0: 70, 1: 80, 2: 88, 3: 20, 4: 20},
'val3': {0: 80, 1: 90, 2: 99, 3: 30, 4: 30},
'val4': {0: 90, 1: 100, 2: 11, 3: 40, 4: 40}})
I'm trying to obtain the maximum value from every dictionary in a default dictionary of default dictionaries using Python3.
Dictionary Set Up:
d = defaultdict(lambda: defaultdict(int))
My iterator runs through the dictionaries and the csv data I'm using just fine, but when I call max, it doesn't necessarily return the max every time.
Example output:
defaultdict(<class 'int'>, {0: 106, 2: 35, 3: 12})
max = (0, 106)
defaultdict(<class 'int'>, {0: 131, 1: 649, 2: 338, 3: 348, 4: 276, 5: 150, 6: 138, 7: 89, 8: 54, 9: 22, 10: 5, 11: 2})
max = (0, 131)
defaultdict(<class 'int'>, {0: 39, 1: 13, 2: 30, 3: 15, 4: 5, 5: 10, 6: 1, 8: 1})
max = (0, 39)
defaultdict(<class 'int'>, {0: 40, 1: 53, 2: 97, 3: 80, 4: 154, 5: 203, 6: 173, 7: 142, 8: 113, 9: 76, 10: 55, 11: 22, 12: 13, 13: 7})
max = (0, 40)
So sometimes it's right, but far from perfect.
My approach was informed by the answer to this question, but I adapted it to try and make it work for a nested default dictionary. Here's the code I'm using to find the max:
for sub_d in d:
outer_dict = d[sub_d]
print(max(outer_dict.items(), key=lambda x: outer_dict.get(x, 0)))
Any insight would be greatly appreciated. Thanks so much.
If you check the values in outer_dict.items(), they are actually consisted of key value tuples, and since these aren't in your dictionary, they all return 0, and hence returns the index 0.
max(a.keys(),key = lambda x: a.get(x,0))
will get you the index of the max value, and retrieve the value by looking up on the dictionary
In
max(outer_dict.items(), key=lambda x: outer_dict.get(x, 0))
the outer_dict.items() call returns an iterator that produces (key, value) tuples of the items in outer_dict. So the key function gets passed a (key, value) tuple as its x argument, and then tries to find that tuple as a key in outer_dict, and of course that's not going to succeed, so the get call always returns 0.
Instead, we can use a key function that extracts the value from the tuple. eg:
nested = {
'a': {0: 106, 2: 35, 3: 12},
'b': {0: 131, 1: 649, 2: 338, 3: 348, 4: 276, 5: 150, 6: 138, 7: 89,
8: 54, 9: 22, 10: 5, 11: 2},
'c': {0: 39, 1: 13, 2: 30, 3: 15, 4: 5, 5: 10, 6: 1, 8: 1},
'd': {0: 40, 1: 53, 2: 97, 3: 80, 4: 154, 5: 203, 6: 173, 7: 142,
8: 113, 9: 76, 10: 55, 11: 22, 12: 13, 13: 7},
}
for k, subdict in nested.items():
print(k, max((t for t in subdict.items()), key=lambda t: t[1]))
output
a (0, 106)
b (1, 649)
c (0, 39)
d (5, 203)
A more efficient alternative to that lambda is to use itemgetter. Here's a version that puts the maxima into a dictionary:
from operator import itemgetter
nested = {
'a': {0: 106, 2: 35, 3: 12},
'b': {0: 131, 1: 649, 2: 338, 3: 348, 4: 276, 5: 150, 6: 138, 7: 89,
8: 54, 9: 22, 10: 5, 11: 2},
'c': {0: 39, 1: 13, 2: 30, 3: 15, 4: 5, 5: 10, 6: 1, 8: 1},
'd': {0: 40, 1: 53, 2: 97, 3: 80, 4: 154, 5: 203, 6: 173, 7: 142,
8: 113, 9: 76, 10: 55, 11: 22, 12: 13, 13: 7},
}
ig1 = itemgetter(1)
maxes = {k: max((t for t in subdict.items()), key=ig1)
for k, subdict in nested.items()}
print(maxes)
output
{'a': (0, 106), 'b': (1, 649), 'c': (0, 39), 'd': (5, 203)}
We define ig1 outside the dictionary comprehension so that we don't call itemgetter(1) on every iteration of the outer loop.