Dataframe with column of strings to column of integer lists - python

I have a dataframe where in one column, the data for each row is a string like this:
[[25570], [26000]]
I want each entry in the series to become a list of integers.
IE:
[25570, 26000]
^ ^
int int
So far I can get it to a list of strings, but retaining empty spaces:
s = s.str.replace("[","").str.replace("]","")
s = s.str.replace(" ","").str.split(",")
Dict for Dataframe:
f = {'chunk': {0: '[72]',
1: '[72, 68]',
2: '[72, 68, 65]',
3: '[72, 68, 65, 70]',
4: '[72, 68, 65, 70, 67]',
5: '[72, 68, 65, 70, 67, 74]',
6: '[68]',
7: '[68, 65]',
8: '[68, 65, 70]',
9: '[68, 65, 70, 67]'},
'chunk_completed': {0: '[25570]',
1: '[26000]',
2: '[26240]',
3: '[26530]',
4: '[26880]',
5: '[27150]',
6: '[26000]',
7: '[26240]',
8: '[26530]',
9: '[26880]'},
'chunk_id': {0: '72',
1: '72-68',
2: '72-68-65',
3: '72-68-65-70',
4: '72-68-65-70-67',
5: '72-68-65-70-67-74',
6: '68',
7: '68-65',
8: '68-65-70',
9: '68-65-70-67'},
'diffs_avg': {0: nan,
1: 430.0,
2: 335.0,
3: 320.0,
4: 327.5,
5: 316.0,
6: nan,
7: 240.0,
8: 265.0,
9: 293.3333333333333},
'sd': {0: nan,
1: nan,
2: 134.35028842544406,
3: 98.48857801796105,
4: 81.80260794538685,
5: 75.3657747256671,
6: nan,
7: nan,
8: 35.355339059327385,
9: 55.075705472861024},
'timecodes': {0: '[[25570]]',
1: '[[25570], [26000]]',
2: '[[25570], [26000], [26240]]',
3: '[[25570], [26000], [26240], [26530]]',
4: '[[25570], [26000], [26240], [26530], [26880]]',
5: '[[25570], [26000], [26240], [26530], [26880], [27150]]',
6: '[[26000]]',
7: '[[26000], [26240]]',
8: '[[26000], [26240], [26530]]',
9: '[[26000], [26240], [26530], [26880]]'}}

try this
f = pd.DataFrame().from_dict(s, orient='index')
f.columns = ['timecodes']
f['timecodes'].apply(lambda x: [a[0] for a in eval(x) if a])
Output
Out[622]:
0 [25570]
1 [25570, 26000]
2 [25570, 26000, 26240]
3 [25570, 26000, 26240, 26530]
4 [25570, 26000, 26240, 26530, 26880]
5 [25570, 26000, 26240, 26530, 26880, 27150]
6 [26000]
7 [26000, 26240]
8 [26000, 26240, 26530]
9 [26000, 26240, 26530, 26880]
10 [26000, 26240, 26530, 26880, 27150]
11 [26240]
12 [26240, 26530]
13 [26240, 26530, 26880]
14 [26240, 26530, 26880, 27150]
15 [26530]
16 [26530, 26880]
17 [26530, 26880, 27150]
18 [26880]
19 [26880, 27150]
Name: 0, dtype: object

Related

Efficient way to apply function with multiple operations on dataframe row

I have a pandas dataframe that looks like this:
X[m] Y[m] Z[m] ... beta newx newy
0 1.439485 0.087100 0.029771 ... 0.063807 1439 87
1 1.439485 0.089729 0.029121 ... 0.065871 1439 89
2 1.439485 0.091992 0.030059 ... 0.067653 1439 91
3 1.439485 0.082073 0.030721 ... 0.059883 1439 82
4 1.439485 0.084095 0.028952 ... 0.061458 1439 84
5 1.439485 0.085937 0.028019 ... 0.062897 1439 85
There are hundreds of thousands of such lines, while I have multiple dataframes like this. X and Y are coordinates on plane (Z is not important) that is moved 45 degrees by the middle to the right. I need to put all points back to the original place, -45 degrees from its location. I have variables newx and newy that represent coordinates before changing, I want to edit these two columns to have values of new coordinates. As I know coordinates of middle point, the point itself, the angle of middle-to-point (alpha) and angle middle-to-fixedpoint (beta), I can use approach presented in mathematics SO. I have transformed the code to python like this:
for i in range(len(df)):
if df.iloc[i].alpha == math.pi/2 or df.iloc[i].alpha == 3*math.pi/2:
df.newx[i] = mid
df.newy[i] = int(math.tan(df.iloc[i].beta*(df.iloc[i].x-mid)+mid))
elif df.iloc[i].beta == math.pi/2 or df.iloc[i].beta == 3*math.pi/2:
#df.newx[i] = df.iloc[i].x -- this is already set
df.newy[i] = int(math.tan(df.iloc[i].alpha*(mid-df.iloc[i].x)+mid))
else:
m0 = math.tan(df.iloc[i].alpha)
m1 = math.tan(df.iloc[i].beta)
x = ((m0 * df.iloc[i].x - m1 * mid) - (df.iloc[i].y - mid)) / (m0 - m1)
df.newx[i] = int(x)
df.newy[i] = int(m0 * (x - df.iloc[i].x) + df.iloc[i].y)
Although this does what I need and moves the point to the correct position, the time complexity is enormous and I have too much files to proceed it like this. I know that there are way faster methods, such as serialization, apply and list comprehension. I however can't figure out how to use it with this function.
Here are first 10 lines as dictionary:
{'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}, 'newx': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'newy': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}}
I suspect how we're using mid as from your code may be causing you problems. Is mid a numeric? Are the x- and y-coordinates of your middle point the same value?
#OP, please confirm your variable names as compared to your linked source are as I have translated them:
linked name
your name
a0
beta
a1
alpha
(x0, y0)
(df.x, df.y)
(x1, y1)
(mid, mid)
Update this answer shares some ideas with #mitoRibo's answer, but I re-translated from OP's linked source and suspect OP made some transcription error. Noted in comments. Both of us used a strategy of "selectively calculate newx/newy using masking, where the masks are equivalent to the if/elif/else conditions provided".
#setup
import pandas as pd
import numpy as np
import math
df = pd.DataFrame({'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}})
# make the new columns
df['newx'] = np.nan
df['newy'] = np.nan
# if any of the values are np.nan when we're done, something went wrong
# Do the float `between` comparison but cleverly
EPSILON = 1e-6
# windows = ((pi/2 ± EPSILON), (3pi/2 ± EPSILON))
windows = tuple(tuple(d*math.pi/2 + s*EPSILON for s in (1, -1)) for d in (1, 3))
# challenge: make this more DRY (don't repeat yourself)
alpha_cond = sum([df.alpha.between(*w) for w in windows]).astype(bool)
beta_cond = sum([ df.beta.between(*w) for w in windows]).astype(bool)\
& ~alpha_cond
neither = (~alpha_cond & ~beta_cond)
# Handle `alpha near pi/2 or 3pi/2`:
c1 = df.loc[alpha_cond]
df.loc[alpha_cond,'newx'] = mid
# changed `tan` parenthesis
# | changed `df.x - mid` to `mid - df.x`
# | | changed to `df.y` from `mid`
df.loc[alpha_cond,'newy'] = (np.tan(c1.beta) * (mid - c1.x) + c1.y).astype(int)
# Handle `beta near pi/2 or 3pi/2`:
c2 = df.loc[beta_cond]
df loc[beta_cond,'newx'] = c2.x
# changed `tan` parenthesis
# | changed `mid - df.x` to `df.x - mid`
df.loc[beta_cond,'newy'] = (np.tan(c2.alpha) * (c2.x - mid) + mid).astype(int)
# Handle the remainder:
c3 = df.loc[neither]
m0 = np.tan(c3.alpha)
m1 = np.tan(c3.beta)
t = ((m0 * c3.x - m1 * mid) - (c3.y - mid)) / (m0 - m1)
df.loc[neither,'newx'] = t.astype(int)
df.loc[neither,'newy'] = (m0 * (t - c3.x) + c3.y).astype(int)
Same approach as #Joshua Voskamp, but I still wanted to share
import pandas as pd
import numpy as np
import math
df = pd.DataFrame({'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}, 'newx': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'newy': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}})
mid = 0 #not sure what mid value should be
near_threshold = 0.001
alpha_near_half_pi = df.alpha.sub(math.pi/2).abs().le(near_threshold)
alpha_near_three_half_pi = df.alpha.sub(3*math.pi/2).abs().le(near_threshold)
beta_near_half_pi = df.beta.sub(math.pi/2).abs().le(near_threshold)
beta_near_three_half_pi = df.beta.sub(3*math.pi/2).abs().le(near_threshold)
cond1 = alpha_near_half_pi | alpha_near_three_half_pi
cond2 = beta_near_half_pi | beta_near_three_half_pi
cond2 = cond2 & (~cond1) #if cond1 is true, we don't want to do cond2
cond3 = ~(cond1 | cond2) #if neither cond1 nor cond2, then we are in cond3
#Process cond1 rows
c1 = df.loc[cond1]
df.loc[cond1,'newx'] = mid
df.loc[cond1,'newy'] = np.tan(c1.beta*(c1.x-mid)+mid)
#Process cond2 rows
c2 = df.loc[cond2]
df.loc[cond2,'newy'] = np.tan(c2.alpha*(mid-c2.x)+mid)
#Process cond3 rows
c3 = df.loc[cond3]
m0 = np.tan(c3.alpha)
m1 = np.tan(c3.beta)
# Is this a mistake? always 0?
# |
# --------------
x = ((m0 * c3.x - m1 * mid) - (c3.y - c3.y)) / (m0 - m1)
df.loc[cond3,'newx'] = x.astype(int)
df.loc[cond3,'newy'] = (m0 * (x - c3.x) + c3.y).astype(int)
df

In Python, pandas, how to ignore invalid values in python when i convert the columns from hexa to decimal?

when I use:
df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
It stops in the error because of invalid value in Type 2 column because of invalid values (negative values, NaN, string...) for hexa conversion. how to ignore this error or mark the invalid value as zero
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
You can create a personalized function that handles this exception to use in your lambda. For example:
def lambda_int(n):
try:
return int(n, 16)
except ValueError:
return 0
df[["Type 2", "Type 4"]] = df[["Type 2", "Type 4"]].applymap(lambda n: lambda_int(n))
Please go through this, i reconstructed your question and gave steps to follow
1. You first dictionary you provided does not have a value, it has a string "NaN"
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
import pandas as pd
df = pd.DataFrame(data)
df.head()
To check nan in your df and remove them
columns_with_na = df.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False))) #print them in descendng order
Prints 0 and 0 because there is no nan
Reconstructed your data to include a nan by using numpy.nan
import numpy as np
#recreated a dataset and included a nan value : np.nan at Type 2
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: np.nan,
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df2 = pd.DataFrame(data)
df2.head()
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False)))
prints 1 and 1 because there is a nan at Type 2 column
#drop nan values
df2 = df2.dropna(how = 'any')
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
#prints 0 because I dropped all the nan values
df2.head()
To fill nan in df with 0 use:
df2.fillna(0, inplace = True)
Fill in nan with 0 in df2['Type 2'] only:
#if you dont want to change the origina dataframe set inplace to false
df2['Type 2'].fillna(0, inplace = True) #inplace is set to True to change the original df

Double header dataframe, sumif (possibly groupby?) with python

So here is an image of what I have and what I want to get: https://imgur.com/a/RyDbvZD
Basically Those are SUMIF formulas in excel, I would like to recreate that in python, I was trying with pandas groupby().sum() function but I have no clue how to groupby on 2 headers like this, and then how to order the data.
Original dataframe:
df = pd.DataFrame( {'Group': {0: 'Name', 1: 20201001, 2: 20201002, 3: 20201003, 4: 20201004, 5: 20201005, 6: 20201006, 7: 20201007, 8: 20201008, 9: 20201009, 10: 20201010}, 'Credit': {0: 'Credit', 1: 65, 2: 69, 3: 92, 4: 18, 5: 58, 6: 12, 7: 31, 8: 29, 9: 12, 10: 41}, 'Equity': {0: 'Stock', 1: 92, 2: 62, 3: 54, 4: 52, 5: 14, 6: 5, 7: 14, 8: 17, 9: 54, 10: 51}, 'Equity.1': {0: 'Option', 1: 87, 2: 30, 3: 40, 4: 24, 5: 95, 6: 77, 7: 44, 8: 77, 9: 88, 10: 85}, 'Credit.1': {0: 'Credit', 1: 62, 2: 60, 3: 91, 4: 57, 5: 65, 6: 50, 7: 75, 8: 55, 9: 48, 10: 99}, 'Equity.2': {0: 'Option', 1: 61, 2: 91, 3: 38, 4: 3, 5: 71, 6: 51, 7: 74, 8: 41, 9: 59, 10: 31}, 'Bond': {0: 'Bond', 1: 4, 2: 62, 3: 91, 4: 66, 5: 30, 6: 51, 7: 76, 8: 6, 9: 65, 10: 73}, 'Unnamed: 7': {0: 'Stock', 1: 54, 2: 23, 3: 74, 4: 92, 5: 36, 6: 89, 7: 88, 8: 32, 9: 19, 10: 91}, 'Bond.1': {0: 'Bond', 1: 96, 2: 10, 3: 11, 4: 7, 5: 28, 6: 82, 7: 13, 8: 46, 9: 70, 10: 46}, 'Bond.2': {0: 'Bond', 1: 25, 2: 53, 3: 96, 4: 70, 5: 52, 6: 9, 7: 98, 8: 9, 9: 48, 10: 58}, 'Unnamed: 10': {0: float('nan'), 1: 63.0, 2: 80.0, 3: 17.0, 4: 21.0, 5: 30.0, 6: 78.0, 7: 23.0, 8: 31.0, 9: 72.0, 10: 65.0}} )
What I want at the end:
df = pd.DataFrame( {'Group': {0: 20201001, 1: 20201002, 2: 20201003, 3: 20201004, 4: 20201005, 5: 20201006, 6: 20201007, 7: 20201008, 8: 20201009, 9: 20201010}, 'Credit': {0: 127, 1: 129, 2: 183, 3: 75, 4: 123, 5: 62, 6: 106, 7: 84, 8: 60, 9: 140}, 'Equity': {0: 240, 1: 183, 2: 132, 3: 79, 4: 180, 5: 133, 6: 132, 7: 135, 8: 201, 9: 167}, 'Stock': {0: 146, 1: 85, 2: 128, 3: 144, 4: 50, 5: 94, 6: 102, 7: 49, 8: 73, 9: 142}, 'Option': {0: 148, 1: 121, 2: 78, 3: 27, 4: 166, 5: 128, 6: 118, 7: 118, 8: 147, 9: 116}} )
Any ideas where to start on this, or anything is appreciated
Here you go. First row seems to be the real headers so we first move that to column names and set the index to Name
df2 = df.rename(columns = df.loc[0]).drop(index = 0).set_index(['Name'])
Then we groupby by columns and sum
df2.groupby(df2.columns, axis=1, sort = False).sum().reset_index()
and we get
Name Credit Stock Option Bond
0 20201001 127.0 146.0 148.0 125.0
1 20201002 129.0 85.0 121.0 125.0
2 20201003 183.0 128.0 78.0 198.0
3 20201004 75.0 144.0 27.0 143.0
4 20201005 123.0 50.0 166.0 110.0
5 20201006 62.0 94.0 128.0 142.0
6 20201007 106.0 102.0 118.0 187.0
7 20201008 84.0 49.0 118.0 61.0
8 20201009 60.0 73.0 147.0 183.0
9 20201010 140.0 142.0 116.0 177.0
I realise the output is not exactly what you asked for but since we cannot see your SUMIF formulas, I do not know which columns you want to aggregate
Edit
Following up on your comment, I note that, as far as I can tell, the rules for aggregation are somewhat messy so that the same column is included in more than one output column (like Equity.1). I do not think there is much you can do with automation here, and you can replicate your SUMIF experience by directly referencing the columns you want to add. So I think the following gives you what you want
df = df.drop(index =0)
df2 = df[['Group']].copy()
df2['Credit'] = df['Credit'] + df['Credit.1']
df2['Equity'] = df['Equity'] + df['Equity.1']+ df['Equity.2']
df2['Stock'] = df['Equity'] + df['Unnamed: 7']
df2['Option'] = df['Equity.1'] + df['Equity.2']
df2
produces
Group Credit Equity Stock Option
-- -------- -------- -------- ------- --------
1 20201001 127 240 146 148
2 20201002 129 183 85 121
3 20201003 183 132 128 78
4 20201004 75 79 144 27
5 20201005 123 180 50 166
6 20201006 62 133 94 128
7 20201007 106 132 102 118
8 20201008 84 135 49 118
9 20201009 60 201 73 147
10 20201010 140 167 142 116
This also gives you control over which columns to include in the final output
If you want this more automated than you need to do something about labels of your columns, as you would want a unique label for a set of columns you want to aggregate. If the same input column is used in more than one calculation it is probably easiest to just duplicate it with the right labels

Is there a formulaic approach to find the frequency of the sum of combinations?

I have 5 strawberries, 2 lemons, and a banana. For each possible combination of these (including selecting 0), there is a total number of objects. I ultimately want a list of the frequencies at which these sums appear.
[1 strawberry, 0 lemons, 0 bananas] = 1 objects
[2 strawberries, 0 lemons, 1 banana] = 3 objects
[0 strawberries, 1 lemon, 0 bananas] = 1 objects
[2 strawberries, 1 lemon, 0 bananas] = 3 objects
[3 strawberries, 0 lemons, 0 bananas] = 3 objects
For just the above selection of 5 combinations, "1" has a frequency of 2 and "3" has a frequency of 3.
Obviously there are far more possible combinations, each changing the frequency result. Is there a formulaic way to approach the problem to find the frequencies for an entire set of combinations?
Currently, I've set up a brute-force function in Python.
special_cards = {
'A':7, 'B':1, 'C':1, 'D':1, 'E':1, 'F':1, 'G':1, 'H':1, 'I':1, 'J':1, 'K':1, 'L':1,
'M':1, 'N':1, 'O':1, 'P':1, 'Q':1, 'R':1, 'S':1, 'T':1, 'U':1, 'V':1, 'W':1, 'X':1,
'Y':1, 'Z':1, 'AA':1, 'AB':1, 'AC':1, 'AD':1, 'AE':1, 'AF':1, 'AG':1, 'AH':1, 'AI':1, 'AJ':1,
'AK':1, 'AL':1, 'AM':1, 'AN':1, 'AO':1, 'AP':1, 'AQ':1, 'AR':1, 'AS':1, 'AT':1, 'AU':1, 'AV':1,
'AW':1, 'AX':1, 'AY':1
}
def _calc_dis_specials(special_cards):
"""Calculate the total combinations when special cards are factored in"""
# Create an iterator for special card combinations.
special_paths = _gen_dis_special_list(special_cards)
freq = {}
path_count = 0
for o_path in special_paths: # Loop through the iterator
path_count += 1 # Keep track of how many combinations we've evaluated thus far.
try: # I've been told I can use a collections.counter() object instead of try/except.
path_sum = sum(o_path) # Sum the path (counting objects)
new_count = freq[path_sum] + 1 # Try to increment the count for our sum.
freq.update({path_sum: new_count})
except KeyError:
freq.update({path_sum: 1})
print(f"{path_count:,}\n{freq}")
print(f"{path_count:,}\n{freq}")
# Do things with results yadda yadda
def _gen_dis_special_list(special_cards):
"""Generates an iterator for all combinations for special cards"""
product_args = []
for value in special_cards.values(): # A card's "value" is the maximum number that can be in a deck.
product_args.append(range(value+1)) # Populates product_args with lists of each card's possible count.
result = itertools.product(*product_args)
return result
However, for large numbers of object pools (50+) the factorial just gets out of hand. Billions upon billions of combinations. I need a formulaic approach.
Looking at some output, I notice a couple of things:
1
{0: 1}
2
{0: 1, 1: 1}
4
{0: 1, 1: 2, 2: 1}
8
{0: 1, 1: 3, 2: 3, 3: 1}
16
{0: 1, 1: 4, 2: 6, 3: 4, 4: 1}
32
{0: 1, 1: 5, 2: 10, 3: 10, 4: 5, 5: 1}
64
{0: 1, 1: 6, 2: 15, 3: 20, 4: 15, 5: 6, 6: 1}
128
{0: 1, 1: 7, 2: 21, 3: 35, 4: 35, 5: 21, 6: 7, 7: 1}
256
{0: 1, 1: 8, 2: 28, 3: 56, 4: 70, 5: 56, 6: 28, 7: 8, 8: 1}
512
{0: 1, 1: 9, 2: 36, 3: 84, 4: 126, 5: 126, 6: 84, 7: 36, 8: 9, 9: 1}
1,024
{0: 1, 1: 10, 2: 45, 3: 120, 4: 210, 5: 252, 6: 210, 7: 120, 8: 45, 9: 10, 10: 1}
2,048
{0: 1, 1: 11, 2: 55, 3: 165, 4: 330, 5: 462, 6: 462, 7: 330, 8: 165, 9: 55, 10: 11, 11: 1}
4,096
{0: 1, 1: 12, 2: 66, 3: 220, 4: 495, 5: 792, 6: 924, 7: 792, 8: 495, 9: 220, 10: 66, 11: 12, 12: 1}
8,192
{0: 1, 1: 13, 2: 78, 3: 286, 4: 715, 5: 1287, 6: 1716, 7: 1716, 8: 1287, 9: 715, 10: 286, 11: 78, 12: 13, 13: 1}
16,384
{0: 1, 1: 14, 2: 91, 3: 364, 4: 1001, 5: 2002, 6: 3003, 7: 3432, 8: 3003, 9: 2002, 10: 1001, 11: 364, 12: 91, 13: 14, 14: 1}
32,768
{0: 1, 1: 15, 2: 105, 3: 455, 4: 1365, 5: 3003, 6: 5005, 7: 6435, 8: 6435, 9: 5005, 10: 3003, 11: 1365, 12: 455, 13: 105, 14: 15, 15: 1}
65,536
{0: 1, 1: 16, 2: 120, 3: 560, 4: 1820, 5: 4368, 6: 8008, 7: 11440, 8: 12870, 9: 11440, 10: 8008, 11: 4368, 12: 1820, 13: 560, 14: 120, 15: 16, 16: 1}
131,072
{0: 1, 1: 17, 2: 136, 3: 680, 4: 2380, 5: 6188, 6: 12376, 7: 19448, 8: 24310, 9: 24310, 10: 19448, 11: 12376, 12: 6188, 13: 2380, 14: 680, 15: 136, 16: 17, 17: 1}
262,144
{0: 1, 1: 18, 2: 153, 3: 816, 4: 3060, 5: 8568, 6: 18564, 7: 31824, 8: 43758, 9: 48620, 10: 43758, 11: 31824, 12: 18564, 13: 8568, 14: 3060, 15: 816, 16: 153, 17: 18, 18: 1}
524,288
{0: 1, 1: 19, 2: 171, 3: 969, 4: 3876, 5: 11628, 6: 27132, 7: 50388, 8: 75582, 9: 92378, 10: 92378, 11: 75582, 12: 50388, 13: 27132, 14: 11628, 15: 3876, 16: 969, 17: 171, 18: 19, 19: 1}
1,048,576
{0: 1, 1: 20, 2: 190, 3: 1140, 4: 4845, 5: 15504, 6: 38760, 7: 77520, 8: 125970, 9: 167960, 10: 184756, 11: 167960, 12: 125970, 13: 77520, 14: 38760, 15: 15504, 16: 4845, 17: 1140, 18: 190, 19: 20, 20: 1}
2,097,152
{0: 1, 1: 21, 2: 210, 3: 1330, 4: 5985, 5: 20349, 6: 54264, 7: 116280, 8: 203490, 9: 293930, 10: 352716, 11: 352716, 12: 293930, 13: 203490, 14: 116280, 15: 54264, 16: 20349, 17: 5985, 18: 1330, 19: 210, 20: 21, 21: 1}
4,194,304
{0: 1, 1: 22, 2: 231, 3: 1540, 4: 7315, 5: 26334, 6: 74613, 7: 170544, 8: 319770, 9: 497420, 10: 646646, 11: 705432, 12: 646646, 13: 497420, 14: 319770, 15: 170544, 16: 74613, 17: 26334, 18: 7315, 19: 1540, 20: 231, 21: 22, 22: 1}
8,388,608
{0: 1, 1: 23, 2: 253, 3: 1771, 4: 8855, 5: 33649, 6: 100947, 7: 245157, 8: 490314, 9: 817190, 10: 1144066, 11: 1352078, 12: 1352078, 13: 1144066, 14: 817190, 15: 490314, 16: 245157, 17: 100947, 18: 33649, 19: 8855, 20: 1771, 21: 253, 22: 23, 23: 1}
16,777,216
{0: 1, 1: 24, 2: 276, 3: 2024, 4: 10626, 5: 42504, 6: 134596, 7: 346104, 8: 735471, 9: 1307504, 10: 1961256, 11: 2496144, 12: 2704156, 13: 2496144, 14: 1961256, 15: 1307504, 16: 735471, 17: 346104, 18: 134596, 19: 42504, 20: 10626, 21: 2024, 22: 276, 23: 24, 24: 1}
33,554,432
{0: 1, 1: 25, 2: 300, 3: 2300, 4: 12650, 5: 53130, 6: 177100, 7: 480700, 8: 1081575, 9: 2042975, 10: 3268760, 11: 4457400, 12: 5200300, 13: 5200300, 14: 4457400, 15: 3268760, 16: 2042975, 17: 1081575, 18: 480700, 19: 177100, 20: 53130, 21: 12650, 22: 2300, 23: 300, 24: 25, 25: 1}
67,108,864
{0: 1, 1: 26, 2: 325, 3: 2600, 4: 14950, 5: 65780, 6: 230230, 7: 657800, 8: 1562275, 9: 3124550, 10: 5311735, 11: 7726160, 12: 9657700, 13: 10400600, 14: 9657700, 15: 7726160, 16: 5311735, 17: 3124550, 18: 1562275, 19: 657800, 20: 230230, 21: 65780, 22: 14950, 23: 2600, 24: 325, 25: 26, 26: 1}
134,217,728
{0: 1, 1: 27, 2: 351, 3: 2925, 4: 17550, 5: 80730, 6: 296010, 7: 888030, 8: 2220075, 9: 4686825, 10: 8436285, 11: 13037895, 12: 17383860, 13: 20058300, 14: 20058300, 15: 17383860, 16: 13037895, 17: 8436285, 18: 4686825, 19: 2220075, 20: 888030, 21: 296010, 22: 80730, 23: 17550, 24: 2925, 25: 351, 26: 27, 27: 1}
268,435,456
{0: 1, 1: 28, 2: 378, 3: 3276, 4: 20475, 5: 98280, 6: 376740, 7: 1184040, 8: 3108105, 9: 6906900, 10: 13123110, 11: 21474180, 12: 30421755, 13: 37442160, 14: 40116600, 15: 37442160, 16: 30421755, 17: 21474180, 18: 13123110, 19: 6906900, 20: 3108105, 21: 1184040, 22: 376740, 23: 98280, 24: 20475, 25: 3276, 26: 378, 27: 28, 28: 1}
536,870,912
{0: 1, 1: 29, 2: 406, 3: 3654, 4: 23751, 5: 118755, 6: 475020, 7: 1560780, 8: 4292145, 9: 10015005, 10: 20030010, 11: 34597290, 12: 51895935, 13: 67863915, 14: 77558760, 15: 77558760, 16: 67863915, 17: 51895935, 18: 34597290, 19: 20030010, 20: 10015005, 21: 4292145, 22: 1560780, 23: 475020, 24: 118755, 25: 23751, 26: 3654, 27: 406, 28: 29, 29: 1}
1,073,741,824
{0: 1, 1: 30, 2: 435, 3: 4060, 4: 27405, 5: 142506, 6: 593775, 7: 2035800, 8: 5852925, 9: 14307150, 10: 30045015, 11: 54627300, 12: 86493225, 13: 119759850, 14: 145422675, 15: 155117520, 16: 145422675, 17: 119759850, 18: 86493225, 19: 54627300, 20: 30045015, 21: 14307150, 22: 5852925, 23: 2035800, 24: 593775, 25: 142506, 26: 27405, 27: 4060, 28: 435, 29: 30, 30: 1}
Note that I'm only printing when a new key (sum) is found.
I notice that
a new sum is found only on powers of 2 and
the results are symmetrical.
This hints to me that there's a formulaic approach that could work.
Any ideas on how to proceed?
Good news; there is a formula for this, and I'll explain the path there in case there is any confusion.
Let's look at your initial example: 5 strawberries (S), 2 lemons (L), and a banana (B). Let's lay out all of the fruits:
S S S S S L L B
We can actually rephrase the question now, because the number of times that 3, for example, will be the total number is the number of different ways you can pick 3 of the fruits from this list.
In statistics, the choose function (a.k.a nCk), answers just this question: how many ways are there to select a group of k items from a group of n items. This is computed as n!/((n-k)!*k!), where "!" is the factorial, a number multiplied by all numbers less than itself. As such the frequency of 3s would be (the number of fruits) "choose" (the total in question), or 8 choose 3. This is 8!/(5!*3!) = 56.

How to do group by on a multiindex in pandas?

Below is my dataframe. I made some transformations to create the category column and dropped the original column it was derived from. Now I need to do a group-by to remove the dups e.g. Love and Fashion can be rolled up via a groupby sum.
df.colunms = array([category, clicks, revenue, date, impressions, size], dtype=object)
df.values=
[[Love 0 0.36823 2013-11-04 380 300x250]
[Love 183 474.81522 2013-11-04 374242 300x250]
[Fashion 0 0.19434 2013-11-04 197 300x250]
[Fashion 9 18.26422 2013-11-04 13363 300x250]]
Here is the index that is created when I created the dataframe
print df.index
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])
I assume I want to drop the index, and create date, and category as a multiindex then do a groupby sum of the metrics. How do I do this in pandas dataframe?
df.head(15).to_dict()= {'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}}
Python is 2.7 and pandas is 0.7.0 on ubuntu 12.04. Below is the error I get if I run the below
import pandas
print pandas.__version__
df = pandas.DataFrame.from_dict(
{
'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'},
'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229},
'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
}
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()
Traceback (most recent call last):
File "/home/ubuntu/workspace/devops/reports/groupby_sub.py", line 9, in <module>
df.set_index(['date', 'category'], inplace=True)
File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1927, in set_index
raise Exception('Index has duplicate keys: %s' % duplicates)
Exception: Index has duplicate keys: [('2013-11-04', 'Celebs'), ('2013-11-04', 'Fashion'), ('2013-11-04', 'Health'), ('2013-11-04', 'Love'), ('2013-11-04', 'Movies')]
You can create the index on the existing dataframe. With the subset of data provided, this works for me:
import pandas
df = pandas.DataFrame.from_dict(
{
'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'},
'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229},
'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
}
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()
If you're having duplicate index issues with the full dataset, you'll need to clean up the data a bit. Remove the duplicate rows if that's amenable. If the duplicate rows are valid, then what sets them apart from each other? If you can add that to the dataframe and include it in the index, that's ideal. If not, just create a dummy column that defaults to 1, but can be 2 or 3 or ... N in the case of N duplicates -- and then include that field in the index as well.
Alternatively, I'm pretty sure you can skip the index creation and directly groupby with columns:
df.groupby(by=['date', 'category']).sum()
Again, that works on the subset of data that you posted.
I usually try to do it when I try to unstack a multi-index and it fails because there are duplicate values.
Here is the simple command that I run the find the problematic items:
df.groupby(level=df.index.names).count()

Categories