Related
For pandas.merge(df1, df2, on='Col_4') will operate by inner join by default which will take rows on the shared columns that have the exact values in these shared columns.
Question: Let us say we have a 4 rows in first df1 and 3 rows in df2. So if all values in the shared columns are the same, then the first row will be added 4 times since we have, so we will have 10 rows for each row from dataframe1. In total, we will have 12 rows.
Problem: Is there a way to stop once we find a first match between the first and the second dataframe and move to the second row in the first dataframe please? However, we can not add the same match added to row 1 in df1 twice. So, suppose row 1 of df1 got matched to row 1 in df2 based on same value in shared column col_4, then the second row in df1 must be matched with second row of df2.
Code:
import pandas as pd
df1 = pd.DataFrame(
{
'ID':[1,2,3,5,9],
'col_1': [1,2,3,4,5],
'col_2':[6,7,8,9,10],
'col_3':[11,12,13,14,15],
'col_4':['apple', 'apple', 'apple', 'apple', 'apple']
}
)
df2 = pd.DataFrame(
{
'ID':[1,1,3,5],
'col_1': [8,9,10,11],
'col_2':[12,13,15,17],
'col_3':[12,13,14,15],
'col_4':['apple', 'apple', 'apple', 'apple']
}
)
pd.merge(df1, df2, on='col_4')
So, as below, how to stop please at first match as red rectangle show where we stop once we find a match from df1 to df2 based on shared column col_4 please? Output should be based on below figure please:
Results in dictionary format:
{'ID_x': {0: 1,
1: 1,
2: 1,
3: 1,
4: 2,
5: 2,
6: 2,
7: 2,
8: 3,
9: 3,
10: 3,
11: 3,
12: 5,
13: 5,
14: 5,
15: 5,
16: 9,
17: 9,
18: 9,
19: 9},
'col_1_x': {0: 1,
1: 1,
2: 1,
3: 1,
4: 2,
5: 2,
6: 2,
7: 2,
8: 3,
9: 3,
10: 3,
11: 3,
12: 4,
13: 4,
14: 4,
15: 4,
16: 5,
17: 5,
18: 5,
19: 5},
'col_2_x': {0: 6,
1: 6,
2: 6,
3: 6,
4: 7,
5: 7,
6: 7,
7: 7,
8: 8,
9: 8,
10: 8,
11: 8,
12: 9,
13: 9,
14: 9,
15: 9,
16: 10,
17: 10,
18: 10,
19: 10},
'col_3_x': {0: 11,
1: 11,
2: 11,
3: 11,
4: 12,
5: 12,
6: 12,
7: 12,
8: 13,
9: 13,
10: 13,
11: 13,
12: 14,
13: 14,
14: 14,
15: 14,
16: 15,
17: 15,
18: 15,
19: 15},
'col_4': {0: 'apple',
1: 'apple',
2: 'apple',
3: 'apple',
4: 'apple',
5: 'apple',
6: 'apple',
7: 'apple',
8: 'apple',
9: 'apple',
10: 'apple',
11: 'apple',
12: 'apple',
13: 'apple',
14: 'apple',
15: 'apple',
16: 'apple',
17: 'apple',
18: 'apple',
19: 'apple'},
'ID_y': {0: 1,
1: 1,
2: 3,
3: 5,
4: 1,
5: 1,
6: 3,
7: 5,
8: 1,
9: 1,
10: 3,
11: 5,
12: 1,
13: 1,
14: 3,
15: 5,
16: 1,
17: 1,
18: 3,
19: 5},
'col_1_y': {0: 8,
1: 9,
2: 10,
3: 11,
4: 8,
5: 9,
6: 10,
7: 11,
8: 8,
9: 9,
10: 10,
11: 11,
12: 8,
13: 9,
14: 10,
15: 11,
16: 8,
17: 9,
18: 10,
19: 11},
'col_2_y': {0: 12,
1: 13,
2: 15,
3: 17,
4: 12,
5: 13,
6: 15,
7: 17,
8: 12,
9: 13,
10: 15,
11: 17,
12: 12,
13: 13,
14: 15,
15: 17,
16: 12,
17: 13,
18: 15,
19: 17},
'col_3_y': {0: 12,
1: 13,
2: 14,
3: 15,
4: 12,
5: 13,
6: 14,
7: 15,
8: 12,
9: 13,
10: 14,
11: 15,
12: 12,
13: 13,
14: 14,
15: 15,
16: 12,
17: 13,
18: 14,
19: 15}}
Edit: if first row from df1 got matched with first row from df2, then the second row from df1 cannot be matched again with first row of df2 but it should be matched with the second row from df2 if there is a match.
You can add a serial number serial for each group of same value of col_4 in each of df1 and df2. Then, merge by col_4 and this serial number serial, as follows:
We generate the serial number by .groupby() + cumcount():
df1['serial'] = df1.groupby('col_4').cumcount()
df2['serial'] = df2.groupby('col_4').cumcount()
df1.merge(df2, on=['col_4', 'serial'])
Result:
ID_x col_1_x col_2_x col_3_x col_4 serial ID_y col_1_y col_2_y col_3_y
0 1 1 6 11 apple 0 1 8 12 12
1 2 2 7 12 apple 1 1 9 13 13
2 3 3 8 13 apple 2 3 10 15 14
3 5 4 9 14 apple 3 5 11 17 15
Optionally, you can further remove this serial number column serial, as follows:
df1.merge(df2, on=['col_4', 'serial']).drop('serial', axis=1)
Result:
ID_x col_1_x col_2_x col_3_x col_4 ID_y col_1_y col_2_y col_3_y
0 1 1 6 11 apple 1 8 12 12
1 2 2 7 12 apple 1 9 13 13
2 3 3 8 13 apple 3 10 15 14
3 5 4 9 14 apple 5 11 17 15
Edit
You can also simplify the codes by incorporating the generations of serial numbers into the step of .merge(), as follows: (Thanks for the suggestion by #HenryEcker)
df1.merge(df2,
left_on=['col_4', df1.groupby('col_4').cumcount()],
right_on=['col_4', df2.groupby('col_4').cumcount()]
).drop('key_1', axis=1)
I have 5 strawberries, 2 lemons, and a banana. For each possible combination of these (including selecting 0), there is a total number of objects. I ultimately want a list of the frequencies at which these sums appear.
[1 strawberry, 0 lemons, 0 bananas] = 1 objects
[2 strawberries, 0 lemons, 1 banana] = 3 objects
[0 strawberries, 1 lemon, 0 bananas] = 1 objects
[2 strawberries, 1 lemon, 0 bananas] = 3 objects
[3 strawberries, 0 lemons, 0 bananas] = 3 objects
For just the above selection of 5 combinations, "1" has a frequency of 2 and "3" has a frequency of 3.
Obviously there are far more possible combinations, each changing the frequency result. Is there a formulaic way to approach the problem to find the frequencies for an entire set of combinations?
Currently, I've set up a brute-force function in Python.
special_cards = {
'A':7, 'B':1, 'C':1, 'D':1, 'E':1, 'F':1, 'G':1, 'H':1, 'I':1, 'J':1, 'K':1, 'L':1,
'M':1, 'N':1, 'O':1, 'P':1, 'Q':1, 'R':1, 'S':1, 'T':1, 'U':1, 'V':1, 'W':1, 'X':1,
'Y':1, 'Z':1, 'AA':1, 'AB':1, 'AC':1, 'AD':1, 'AE':1, 'AF':1, 'AG':1, 'AH':1, 'AI':1, 'AJ':1,
'AK':1, 'AL':1, 'AM':1, 'AN':1, 'AO':1, 'AP':1, 'AQ':1, 'AR':1, 'AS':1, 'AT':1, 'AU':1, 'AV':1,
'AW':1, 'AX':1, 'AY':1
}
def _calc_dis_specials(special_cards):
"""Calculate the total combinations when special cards are factored in"""
# Create an iterator for special card combinations.
special_paths = _gen_dis_special_list(special_cards)
freq = {}
path_count = 0
for o_path in special_paths: # Loop through the iterator
path_count += 1 # Keep track of how many combinations we've evaluated thus far.
try: # I've been told I can use a collections.counter() object instead of try/except.
path_sum = sum(o_path) # Sum the path (counting objects)
new_count = freq[path_sum] + 1 # Try to increment the count for our sum.
freq.update({path_sum: new_count})
except KeyError:
freq.update({path_sum: 1})
print(f"{path_count:,}\n{freq}")
print(f"{path_count:,}\n{freq}")
# Do things with results yadda yadda
def _gen_dis_special_list(special_cards):
"""Generates an iterator for all combinations for special cards"""
product_args = []
for value in special_cards.values(): # A card's "value" is the maximum number that can be in a deck.
product_args.append(range(value+1)) # Populates product_args with lists of each card's possible count.
result = itertools.product(*product_args)
return result
However, for large numbers of object pools (50+) the factorial just gets out of hand. Billions upon billions of combinations. I need a formulaic approach.
Looking at some output, I notice a couple of things:
1
{0: 1}
2
{0: 1, 1: 1}
4
{0: 1, 1: 2, 2: 1}
8
{0: 1, 1: 3, 2: 3, 3: 1}
16
{0: 1, 1: 4, 2: 6, 3: 4, 4: 1}
32
{0: 1, 1: 5, 2: 10, 3: 10, 4: 5, 5: 1}
64
{0: 1, 1: 6, 2: 15, 3: 20, 4: 15, 5: 6, 6: 1}
128
{0: 1, 1: 7, 2: 21, 3: 35, 4: 35, 5: 21, 6: 7, 7: 1}
256
{0: 1, 1: 8, 2: 28, 3: 56, 4: 70, 5: 56, 6: 28, 7: 8, 8: 1}
512
{0: 1, 1: 9, 2: 36, 3: 84, 4: 126, 5: 126, 6: 84, 7: 36, 8: 9, 9: 1}
1,024
{0: 1, 1: 10, 2: 45, 3: 120, 4: 210, 5: 252, 6: 210, 7: 120, 8: 45, 9: 10, 10: 1}
2,048
{0: 1, 1: 11, 2: 55, 3: 165, 4: 330, 5: 462, 6: 462, 7: 330, 8: 165, 9: 55, 10: 11, 11: 1}
4,096
{0: 1, 1: 12, 2: 66, 3: 220, 4: 495, 5: 792, 6: 924, 7: 792, 8: 495, 9: 220, 10: 66, 11: 12, 12: 1}
8,192
{0: 1, 1: 13, 2: 78, 3: 286, 4: 715, 5: 1287, 6: 1716, 7: 1716, 8: 1287, 9: 715, 10: 286, 11: 78, 12: 13, 13: 1}
16,384
{0: 1, 1: 14, 2: 91, 3: 364, 4: 1001, 5: 2002, 6: 3003, 7: 3432, 8: 3003, 9: 2002, 10: 1001, 11: 364, 12: 91, 13: 14, 14: 1}
32,768
{0: 1, 1: 15, 2: 105, 3: 455, 4: 1365, 5: 3003, 6: 5005, 7: 6435, 8: 6435, 9: 5005, 10: 3003, 11: 1365, 12: 455, 13: 105, 14: 15, 15: 1}
65,536
{0: 1, 1: 16, 2: 120, 3: 560, 4: 1820, 5: 4368, 6: 8008, 7: 11440, 8: 12870, 9: 11440, 10: 8008, 11: 4368, 12: 1820, 13: 560, 14: 120, 15: 16, 16: 1}
131,072
{0: 1, 1: 17, 2: 136, 3: 680, 4: 2380, 5: 6188, 6: 12376, 7: 19448, 8: 24310, 9: 24310, 10: 19448, 11: 12376, 12: 6188, 13: 2380, 14: 680, 15: 136, 16: 17, 17: 1}
262,144
{0: 1, 1: 18, 2: 153, 3: 816, 4: 3060, 5: 8568, 6: 18564, 7: 31824, 8: 43758, 9: 48620, 10: 43758, 11: 31824, 12: 18564, 13: 8568, 14: 3060, 15: 816, 16: 153, 17: 18, 18: 1}
524,288
{0: 1, 1: 19, 2: 171, 3: 969, 4: 3876, 5: 11628, 6: 27132, 7: 50388, 8: 75582, 9: 92378, 10: 92378, 11: 75582, 12: 50388, 13: 27132, 14: 11628, 15: 3876, 16: 969, 17: 171, 18: 19, 19: 1}
1,048,576
{0: 1, 1: 20, 2: 190, 3: 1140, 4: 4845, 5: 15504, 6: 38760, 7: 77520, 8: 125970, 9: 167960, 10: 184756, 11: 167960, 12: 125970, 13: 77520, 14: 38760, 15: 15504, 16: 4845, 17: 1140, 18: 190, 19: 20, 20: 1}
2,097,152
{0: 1, 1: 21, 2: 210, 3: 1330, 4: 5985, 5: 20349, 6: 54264, 7: 116280, 8: 203490, 9: 293930, 10: 352716, 11: 352716, 12: 293930, 13: 203490, 14: 116280, 15: 54264, 16: 20349, 17: 5985, 18: 1330, 19: 210, 20: 21, 21: 1}
4,194,304
{0: 1, 1: 22, 2: 231, 3: 1540, 4: 7315, 5: 26334, 6: 74613, 7: 170544, 8: 319770, 9: 497420, 10: 646646, 11: 705432, 12: 646646, 13: 497420, 14: 319770, 15: 170544, 16: 74613, 17: 26334, 18: 7315, 19: 1540, 20: 231, 21: 22, 22: 1}
8,388,608
{0: 1, 1: 23, 2: 253, 3: 1771, 4: 8855, 5: 33649, 6: 100947, 7: 245157, 8: 490314, 9: 817190, 10: 1144066, 11: 1352078, 12: 1352078, 13: 1144066, 14: 817190, 15: 490314, 16: 245157, 17: 100947, 18: 33649, 19: 8855, 20: 1771, 21: 253, 22: 23, 23: 1}
16,777,216
{0: 1, 1: 24, 2: 276, 3: 2024, 4: 10626, 5: 42504, 6: 134596, 7: 346104, 8: 735471, 9: 1307504, 10: 1961256, 11: 2496144, 12: 2704156, 13: 2496144, 14: 1961256, 15: 1307504, 16: 735471, 17: 346104, 18: 134596, 19: 42504, 20: 10626, 21: 2024, 22: 276, 23: 24, 24: 1}
33,554,432
{0: 1, 1: 25, 2: 300, 3: 2300, 4: 12650, 5: 53130, 6: 177100, 7: 480700, 8: 1081575, 9: 2042975, 10: 3268760, 11: 4457400, 12: 5200300, 13: 5200300, 14: 4457400, 15: 3268760, 16: 2042975, 17: 1081575, 18: 480700, 19: 177100, 20: 53130, 21: 12650, 22: 2300, 23: 300, 24: 25, 25: 1}
67,108,864
{0: 1, 1: 26, 2: 325, 3: 2600, 4: 14950, 5: 65780, 6: 230230, 7: 657800, 8: 1562275, 9: 3124550, 10: 5311735, 11: 7726160, 12: 9657700, 13: 10400600, 14: 9657700, 15: 7726160, 16: 5311735, 17: 3124550, 18: 1562275, 19: 657800, 20: 230230, 21: 65780, 22: 14950, 23: 2600, 24: 325, 25: 26, 26: 1}
134,217,728
{0: 1, 1: 27, 2: 351, 3: 2925, 4: 17550, 5: 80730, 6: 296010, 7: 888030, 8: 2220075, 9: 4686825, 10: 8436285, 11: 13037895, 12: 17383860, 13: 20058300, 14: 20058300, 15: 17383860, 16: 13037895, 17: 8436285, 18: 4686825, 19: 2220075, 20: 888030, 21: 296010, 22: 80730, 23: 17550, 24: 2925, 25: 351, 26: 27, 27: 1}
268,435,456
{0: 1, 1: 28, 2: 378, 3: 3276, 4: 20475, 5: 98280, 6: 376740, 7: 1184040, 8: 3108105, 9: 6906900, 10: 13123110, 11: 21474180, 12: 30421755, 13: 37442160, 14: 40116600, 15: 37442160, 16: 30421755, 17: 21474180, 18: 13123110, 19: 6906900, 20: 3108105, 21: 1184040, 22: 376740, 23: 98280, 24: 20475, 25: 3276, 26: 378, 27: 28, 28: 1}
536,870,912
{0: 1, 1: 29, 2: 406, 3: 3654, 4: 23751, 5: 118755, 6: 475020, 7: 1560780, 8: 4292145, 9: 10015005, 10: 20030010, 11: 34597290, 12: 51895935, 13: 67863915, 14: 77558760, 15: 77558760, 16: 67863915, 17: 51895935, 18: 34597290, 19: 20030010, 20: 10015005, 21: 4292145, 22: 1560780, 23: 475020, 24: 118755, 25: 23751, 26: 3654, 27: 406, 28: 29, 29: 1}
1,073,741,824
{0: 1, 1: 30, 2: 435, 3: 4060, 4: 27405, 5: 142506, 6: 593775, 7: 2035800, 8: 5852925, 9: 14307150, 10: 30045015, 11: 54627300, 12: 86493225, 13: 119759850, 14: 145422675, 15: 155117520, 16: 145422675, 17: 119759850, 18: 86493225, 19: 54627300, 20: 30045015, 21: 14307150, 22: 5852925, 23: 2035800, 24: 593775, 25: 142506, 26: 27405, 27: 4060, 28: 435, 29: 30, 30: 1}
Note that I'm only printing when a new key (sum) is found.
I notice that
a new sum is found only on powers of 2 and
the results are symmetrical.
This hints to me that there's a formulaic approach that could work.
Any ideas on how to proceed?
Good news; there is a formula for this, and I'll explain the path there in case there is any confusion.
Let's look at your initial example: 5 strawberries (S), 2 lemons (L), and a banana (B). Let's lay out all of the fruits:
S S S S S L L B
We can actually rephrase the question now, because the number of times that 3, for example, will be the total number is the number of different ways you can pick 3 of the fruits from this list.
In statistics, the choose function (a.k.a nCk), answers just this question: how many ways are there to select a group of k items from a group of n items. This is computed as n!/((n-k)!*k!), where "!" is the factorial, a number multiplied by all numbers less than itself. As such the frequency of 3s would be (the number of fruits) "choose" (the total in question), or 8 choose 3. This is 8!/(5!*3!) = 56.
I have the following df, weekly spend in a number of shops:
shop1 shop2 shop3 shop4 shop5 shop6 shop7 \
date_week
2 4328.85 5058.17 3028.68 2513.28 4204.10 1898.26 2209.75
3 5472.00 5085.59 3874.51 1951.60 2984.71 1416.40 1199.42
4 4665.53 4264.05 2781.70 2958.25 4593.46 2365.88 2079.73
5 5769.36 3460.79 3072.47 1866.19 3803.12 2166.84 1716.71
6 6267.00 4033.58 4053.70 2215.04 3991.31 2382.02 1974.92
7 5436.83 4402.83 3225.98 1761.87 4202.22 2430.71 3091.33
8 4850.43 4900.68 3176.00 3280.95 3483.53 4115.09 2594.01
9 6782.88 3800.03 3865.65 2221.43 4116.28 2638.28 2321.55
10 6248.18 4096.60 5186.52 3224.96 3614.24 2541.00 2708.36
11 4505.18 2889.33 2937.74 2418.34 5565.57 1570.55 1371.54
12 3115.26 1216.82 1759.49 2559.81 1403.61 1550.77 478.34
13 4561.82 827.16 4661.51 3197.90 1515.63 1688.57 247.25
shop8 shop9
date_week
2 3578.81 3134.39
3 4625.10 2676.20
4 3417.16 3870.00
5 3980.78 3439.60
6 3899.42 4192.41
7 4190.60 3989.00
8 4786.40 3484.51
9 6433.02 3474.66
10 4414.19 3809.20
11 3590.10 3414.50
12 4297.57 2094.00
13 3963.27 871.25
If I plot these in a line plot or "spaghetti plot" It works fine.
The goal is the look at trend in weekly sales over the last three months in 9 stores.
But looks a bit messy:
newgraph.plot()
I had a look at similar questions such as this one which uses df.interpolate() but it looks like I need to have missing values in there first. this answer seems to require a time series.
Is there another method to smoothen out the lines?
It doesn't matter if the values are not exactly accurate anymore, some interpolation is fine. All I am interested in is the trend over the last number of weeks. I have also tried logy=True in the plot() method to calm the lines a bit, but it didn't help.
My df, for pd.DataFrame.fromt_dict():
{'shop1': {2: 4328.849999999999,
3: 5472.0,
4: 4665.530000000001,
5: 5769.36,
6: 6267.0,
7: 5436.83,
8: 4850.43,
9: 6782.879999999999,
10: 6248.18,
11: 4505.18,
12: 3115.26,
13: 4561.82},
'shop2': {2: 5058.169999999993,
3: 5085.589999999996,
4: 4264.049999999997,
5: 3460.7899999999977,
6: 4033.579999999998,
7: 4402.829999999999,
8: 4900.679999999997,
9: 3800.0299999999997,
10: 4096.5999999999985,
11: 2889.3300000000004,
12: 1216.8200000000002,
13: 827.16},
'shop3': {2: 3028.679999999997,
3: 3874.5099999999984,
4: 2781.6999999999994,
5: 3072.4699999999984,
6: 4053.6999999999966,
7: 3225.9799999999987,
8: 3175.9999999999973,
9: 3865.6499999999974,
10: 5186.519999999996,
11: 2937.74,
12: 1759.49,
13: 4661.509999999998},
'shop4': {2: 2513.2799999999997,
3: 1951.6000000000001,
4: 2958.25,
5: 1866.1900000000003,
6: 2215.04,
7: 1761.8700000000001,
8: 3280.9499999999994,
9: 2221.43,
10: 3224.9600000000005,
11: 2418.3399999999997,
12: 2559.8099999999995,
13: 3197.9},
'shop5': {2: 4204.0999999999985,
3: 2984.71,
4: 4593.459999999999,
5: 3803.12,
6: 3991.31,
7: 4202.219999999999,
8: 3483.529999999999,
9: 4116.279999999999,
10: 3614.24,
11: 5565.569999999997,
12: 1403.6100000000001,
13: 1515.63},
'shop6': {2: 1898.260000000001,
3: 1416.4000000000005,
4: 2365.8799999999997,
5: 2166.84,
6: 2382.019999999999,
7: 2430.71,
8: 4115.0899999999965,
9: 2638.2800000000007,
10: 2541.0,
11: 1570.5500000000004,
12: 1550.7700000000002,
13: 1688.5700000000004},
'shop7': {2: 2209.75,
3: 1199.42,
4: 2079.7300000000005,
5: 1716.7100000000005,
6: 1974.9200000000005,
7: 3091.329999999999,
8: 2594.0099999999993,
9: 2321.5499999999997,
10: 2708.3599999999983,
11: 1371.5400000000004,
12: 478.34,
13: 247.25000000000003},
'shop8': {2: 3578.8100000000004,
3: 4625.1,
4: 3417.1599999999994,
5: 3980.7799999999997,
6: 3899.4200000000005,
7: 4190.600000000001,
8: 4786.4,
9: 6433.019999999998,
10: 4414.1900000000005,
11: 3590.1,
12: 4297.57,
13: 3963.27},
'shop9': {2: 3134.3900000000003,
3: 2676.2,
4: 3870.0,
5: 3439.6,
6: 4192.41,
7: 3989.0,
8: 3484.51,
9: 3474.66,
10: 3809.2,
11: 3414.5,
12: 2094.0,
13: 871.25}}
You could show the trend by plotting a regression line for the last few weeks, perhaps separately from the actual data, as the plot is already so crowded. I would use seaborn, because it has the convenient regplot() function:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
df.plot(figsize=[12, 10], style='--')
plt.xlim(2, 18)
last4 = df[len(df)-4:]
plt.gca().set_prop_cycle(None)
for shop in df.columns:
sns.regplot(last4.index + 4, shop, data=last4, ci=None, scatter=False)
plt.ylabel(None)
plt.xticks(list(df.index)+[14, 17], labels=list(df.index)+[10, 13]);
I want to compare average revenue "in offer" vs average revenue "out of offer" for each SKU.
When I merge the below two dataframes on sku I get multiple rows for each entry because in second dataframe sku is not unique. For example every instance of sku = 1 will have two entries because test_offer contains 2 separate offers for sku 1. However there can only be one offer live for a SKU at any time, which should verify the condition:
test_ga['day'] >= test_offer['start_day'] & test_ga['day'] <= test_offer['end_day']
dataset 1
test_ga = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8, 16: 1, 17: 2, 18: 3, 19: 4, 20: 5, 21: 6, 22: 7, 23: 8},
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 3, 17: 3, 18: 3, 19: 3, 20: 3, 21: 3, 22: 3, 23: 3},
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34, 16: 71, 17: 69, 18: 90, 19: 93, 20: 43, 21: 45, 22: 57, 23: 89}} )
dataset 2
test_offer = pd.DataFrame( {'sku': {0: 1, 1: 1, 2: 2},
'offer_number': {0: 5, 1: 6, 2: 7},
'start_day': {0: 2, 1: 6, 2: 4},
'end_day': {0: 4, 1: 7, 2: 8}} )
Expected Output
expected_output = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8},
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2},
'offer': {0: float('nan'), 1: '5', 2: '5', 3: '5', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '7', 12: '7', 13: '7', 14: '7', 15: '7'},
'start_day': {0: float('nan'), 1: '2', 2: '2', 3: '2', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '4', 12: '4', 13: '4', 14: '4', 15: '4'},
'end_day': {0: float('nan'), 1: '4', 2: '4', 3: '4', 4: float('nan'), 5: '7', 6: '7', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '8', 12: '8', 13: '8', 14: '8', 15: '8'},
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34}} )
I did actually find a solution based on this SO answer, but it took me a while and the question is not really clear.
I thought it could still be useful to create this question even if I found a solution. Besides, there are probably better ways to achieve this that do not require to create a dummy variables and sorting the dataframe?
If this question is a duplicate let me know and I will cancel it.
One possible solution:
test_data = pd.merge(test_ga, test_offer, on = 'sku')
# I define if every row is in offer or not.
test_data['is_offer'] = np.where((test_data['day'] >= test_data['start_day']) & (test_data['day'] <= test_data['end_day']), 1, 0)
expected_output = test_data.sort_values(['sku','day','is_offer']).groupby(['day', 'sku']).tail(1)
and then clean up the data adding Nan values for rows not in offer.
expected_output['start_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['start_day'])
expected_output['end_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['end_day'])
expected_output['offer_number'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['offer_number'])
expected_output
Below is my dataframe. I made some transformations to create the category column and dropped the original column it was derived from. Now I need to do a group-by to remove the dups e.g. Love and Fashion can be rolled up via a groupby sum.
df.colunms = array([category, clicks, revenue, date, impressions, size], dtype=object)
df.values=
[[Love 0 0.36823 2013-11-04 380 300x250]
[Love 183 474.81522 2013-11-04 374242 300x250]
[Fashion 0 0.19434 2013-11-04 197 300x250]
[Fashion 9 18.26422 2013-11-04 13363 300x250]]
Here is the index that is created when I created the dataframe
print df.index
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])
I assume I want to drop the index, and create date, and category as a multiindex then do a groupby sum of the metrics. How do I do this in pandas dataframe?
df.head(15).to_dict()= {'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}}
Python is 2.7 and pandas is 0.7.0 on ubuntu 12.04. Below is the error I get if I run the below
import pandas
print pandas.__version__
df = pandas.DataFrame.from_dict(
{
'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'},
'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229},
'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
}
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()
Traceback (most recent call last):
File "/home/ubuntu/workspace/devops/reports/groupby_sub.py", line 9, in <module>
df.set_index(['date', 'category'], inplace=True)
File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1927, in set_index
raise Exception('Index has duplicate keys: %s' % duplicates)
Exception: Index has duplicate keys: [('2013-11-04', 'Celebs'), ('2013-11-04', 'Fashion'), ('2013-11-04', 'Health'), ('2013-11-04', 'Love'), ('2013-11-04', 'Movies')]
You can create the index on the existing dataframe. With the subset of data provided, this works for me:
import pandas
df = pandas.DataFrame.from_dict(
{
'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'},
'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229},
'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
}
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()
If you're having duplicate index issues with the full dataset, you'll need to clean up the data a bit. Remove the duplicate rows if that's amenable. If the duplicate rows are valid, then what sets them apart from each other? If you can add that to the dataframe and include it in the index, that's ideal. If not, just create a dummy column that defaults to 1, but can be 2 or 3 or ... N in the case of N duplicates -- and then include that field in the index as well.
Alternatively, I'm pretty sure you can skip the index creation and directly groupby with columns:
df.groupby(by=['date', 'category']).sum()
Again, that works on the subset of data that you posted.
I usually try to do it when I try to unstack a multi-index and it fails because there are duplicate values.
Here is the simple command that I run the find the problematic items:
df.groupby(level=df.index.names).count()