Group values based on columns and conditions in pandas

Group values based on columns and conditions in pandas - python

I want to group pandas dataframe column based on a condition that if the values are with in a range of +20.
Below is the dataframe
{'Name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F'},
'ID': {0: 100, 1: 23, 2: 19, 3: 42, 4: 11, 5: 78},
'Left': {0: 70, 1: 70, 2: 70, 3: 70, 4: 66, 5: 66},
'Top': {0: 10, 1: 26, 2: 26, 3: 35, 4: 60, 5: 71}}
Here I want to group columns Left and Top.
This is what I did:
df.groupby(['Top'],as_index=False).agg(lambda x: list(x))
This is the result I got :
{'Top': {0: 10, 1: 26, 2: 35, 3: 60, 4: 71},
'Name': {0: ['A'], 1: ['B', 'C'], 2: ['D'], 3: ['E'], 4: ['F']},
'ID': {0: [100], 1: [23, 19], 2: [42], 3: [11], 4: [78]},
'Left': {0: [70], 1: [70, 70], 2: [43], 3: [66], 4: [66]}}
Desired output:
{'Top': {0: [10, 26], 2: 35, 3: [60,71]},
'Name': {0: ['A', 'B', 'C'], 2: ['D'], 3: ['E', 'F']},
'ID': {0: [100, 23, 19], 2: [42], 3: [11, 78]},
'Left': {0: [70, 50, 87], 2: [43], 3: [66, 99]}}
NOTE:
An important thing to consider is that Top values 10 and 26 are in the range of 20, it forms a group. 35 should not be added to the group even though its difference between 26 and 35 are in the range of 20 because 10 and 20 are already in a group and the difference between 10(the least value in the group) and 35 is not in the range of 20.
Is there any any alternate way to solve this?
EDIT:
I have a different use-case for which the top values increase and when it moves to a new page the top value changes and starts increasing again. This goes on for different inputs. And finally I want to group by Input File Name, Page Number and group. How can I group these?
{'Input File Name': {0: 268441,
1: 268441,
2: 268441,
3: 268441,
4: 268441,
5: 268441,
6: 268441,
7: 268441,
8: 268441,
9: 268441,
10: 268441,
11: 268441,
12: 268441,
13: 268441,
14: 268441,
15: 268441,
16: 268441,
17: 268441,
18: 268441,
19: 268441,
20: 268441,
21: 268441,
22: 268441,
23: 268441,
24: 268441,
25: 268441,
26: 268441,
27: 268441,
28: 268441,
29: 268441,
30: 268441,
31: 268441,
32: 268441,
33: 268441,
34: 268441,
35: 268441,
36: 268441,
37: 268441,
38: 268441,
39: 268441},
'Page Number': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 1,
13: 1,
14: 1,
15: 1,
16: 1,
17: 1,
18: 1,
19: 1,
20: 2,
21: 2,
22: 2,
23: 2,
24: 2,
25: 2,
26: 2,
27: 2,
28: 2,
29: 2,
30: 2,
31: 2,
32: 2,
33: 2,
34: 2,
35: 2,
36: 2,
37: 2,
38: 2,
39: 2},
'Content': {0: '3708 Forestview Road',
1: 'AvailableForLease&Sale',
2: '1,700± SFMedicalOffice',
3: '3708ForestviewRoad',
4: 'Suite107',
5: 'Raleigh,NC27612',
6: 'BuildingDescription',
7: '22,278± SFClassAOfficeBuilding',
8: 'OnlyOneSuiteLeft toLeaseand/orPurchase',
9: '(1)1,700± SFShell',
10: 'FlexibleLeaseTerms',
11: '2Floorsw/Elevator&Stairsto2',
12: 'Level',
13: 'nd',
14: 'ClassAFinishes',
15: 'On-SitePropertyManagement',
16: 'LargeGlass Windows',
17: '5:1Parking',
18: 'Formoreinformation,contact:',
19: 'OtherTenants: PivotPhysicalTherapy,TheLundy',
20: 'LeasingDetails',
21: 'SpaceDescription',
22: 'LeaseRate',
23: 'CompetitiveNNN+$5.50TICAM',
24: 'Tenant',
25: 'Suite107:1,700± SF',
26: 'Janitorial&Electric',
27: 'Responsibilities',
28: 'ShellSpacew/TIAllowance&Architecturals',
29: 'ClassABuilding',
30: 'SalePrice',
31: '$374,000or$220PSF',
32: 'BeautifulDouble-DoorEntry',
33: '1,700',
34: '± SF',
35: 'Size',
36: 'LargeGlassWindows',
37: 'ColdDarkShellw/TIAllowance',
38: '5:1Parking',
39: 'Upfit'},
'Top': {0: 6,
1: 6,
2: 49,
3: 103,
4: 103,
5: 103,
6: 590,
7: 637,
8: 656,
9: 676,
10: 695,
11: 716,
12: 716,
13: 717,
14: 736,
15: 755,
16: 775,
17: 794,
18: 813,
19: 835,
20: 111,
21: 138,
22: 142,
23: 142,
24: 169,
25: 174,
26: 179,
27: 190,
28: 195,
29: 216,
30: 217,
31: 217,
32: 238,
33: 247,
34: 247,
35: 248,
36: 259,
37: 274,
38: 282,
39: 285}}

You can write a function to group the Top columns first and then use groupby on that column:
import pandas as pd
df = pd.DataFrame({'Name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F'},
'ID': {0: 100, 1: 23, 2: 19, 3: 42, 4: 11, 5: 78},
'Left': {0: 70, 1: 70, 2: 70, 3: 70, 4: 66, 5: 66},
'Top': {0: 10, 1: 26, 2: 26, 3: 35, 4: 60, 5: 71}})
def group(l, group_range):
groups = []
current_group = []
i = 0
group_count = 1
while i < len(l):
a = l[i]
if len(current_group) == 0:
if i == len(l) - 1:
break
current_group_start = a
if a <= current_group_start + group_range:
current_group.append(group_count)
if a < current_group_start + group_range:
i += 1
else:
groups.extend(current_group)
current_group = []
group_count += 1
groups.extend(current_group)
return groups
#group(df['Top'],20) -> [1, 1, 1, 2, 3, 3]
df['group'] = group(df['Top'],20)
df.groupby(['group'],as_index=False).agg(list)
Output:
group ID Left Name Top
0 1 [100, 23, 19] [70, 70, 70] [A, B, C] [10, 26, 26]
1 2 [42] [70] [D] [35]
2 3 [11, 78] [66, 66] [E, F] [60, 71]

Related

python title disappears when trying to align it to the left in seaborn

Here is the dataframe I'm working with in python. I'm including the dataframe here with this line of code:
print(mtcars.to_dict())
{'Unnamed: 0': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710', 3: 'Hornet 4 Drive', 4: 'Hornet Sportabout', 5: 'Valiant', 6: 'Duster 360', 7: 'Merc 240D', 8: 'Merc 230', 9: 'Merc 280', 10: 'Merc 280C', 11: 'Merc 450SE', 12: 'Merc 450SL', 13: 'Merc 450SLC', 14: 'Cadillac Fleetwood', 15: 'Lincoln Continental', 16: 'Chrysler Imperial', 17: 'Fiat 128', 18: 'Honda Civic', 19: 'Toyota Corolla', 20: 'Toyota Corona', 21: 'Dodge Challenger', 22: 'AMC Javelin', 23: 'Camaro Z28', 24: 'Pontiac Firebird', 25: 'Fiat X1-9', 26: 'Porsche 914-2', 27: 'Lotus Europa', 28: 'Ford Pantera L', 29: 'Ferrari Dino', 30: 'Maserati Bora', 31: 'Volvo 142E'}, 'mpg': {0: 21.0, 1: 21.0, 2: 22.8, 3: 21.4, 4: 18.7, 5: 18.1, 6: 14.3, 7: 24.4, 8: 22.8, 9: 19.2, 10: 17.8, 11: 16.4, 12: 17.3, 13: 15.2, 14: 10.4, 15: 10.4, 16: 14.7, 17: 32.4, 18: 30.4, 19: 33.9, 20: 21.5, 21: 15.5, 22: 15.2, 23: 13.3, 24: 19.2, 25: 27.3, 26: 26.0, 27: 30.4, 28: 15.8, 29: 19.7, 30: 15.0, 31: 21.4}, 'cyl': {0: 6, 1: 6, 2: 4, 3: 6, 4: 8, 5: 6, 6: 8, 7: 4, 8: 4, 9: 6, 10: 6, 11: 8, 12: 8, 13: 8, 14: 8, 15: 8, 16: 8, 17: 4, 18: 4, 19: 4, 20: 4, 21: 8, 22: 8, 23: 8, 24: 8, 25: 4, 26: 4, 27: 4, 28: 8, 29: 6, 30: 8, 31: 4}, 'disp': {0: 160.0, 1: 160.0, 2: 108.0, 3: 258.0, 4: 360.0, 5: 225.0, 6: 360.0, 7: 146.7, 8: 140.8, 9: 167.6, 10: 167.6, 11: 275.8, 12: 275.8, 13: 275.8, 14: 472.0, 15: 460.0, 16: 440.0, 17: 78.7, 18: 75.7, 19: 71.1, 20: 120.1, 21: 318.0, 22: 304.0, 23: 350.0, 24: 400.0, 25: 79.0, 26: 120.3, 27: 95.1, 28: 351.0, 29: 145.0, 30: 301.0, 31: 121.0}, 'hp': {0: 110, 1: 110, 2: 93, 3: 110, 4: 175, 5: 105, 6: 245, 7: 62, 8: 95, 9: 123, 10: 123, 11: 180, 12: 180, 13: 180, 14: 205, 15: 215, 16: 230, 17: 66, 18: 52, 19: 65, 20: 97, 21: 150, 22: 150, 23: 245, 24: 175, 25: 66, 26: 91, 27: 113, 28: 264, 29: 175, 30: 335, 31: 109}, 'drat': {0: 3.9, 1: 3.9, 2: 3.85, 3: 3.08, 4: 3.15, 5: 2.76, 6: 3.21, 7: 3.69, 8: 3.92, 9: 3.92, 10: 3.92, 11: 3.07, 12: 3.07, 13: 3.07, 14: 2.93, 15: 3.0, 16: 3.23, 17: 4.08, 18: 4.93, 19: 4.22, 20: 3.7, 21: 2.76, 22: 3.15, 23: 3.73, 24: 3.08, 25: 4.08, 26: 4.43, 27: 3.77, 28: 4.22, 29: 3.62, 30: 3.54, 31: 4.11}, 'wt': {0: 2.62, 1: 2.875, 2: 2.32, 3: 3.215, 4: 3.44, 5: 3.46, 6: 3.57, 7: 3.19, 8: 3.15, 9: 3.44, 10: 3.44, 11: 4.07, 12: 3.73, 13: 3.78, 14: 5.25, 15: 5.424, 16: 5.345, 17: 2.2, 18: 1.615, 19: 1.835, 20: 2.465, 21: 3.52, 22: 3.435, 23: 3.84, 24: 3.845, 25: 1.935, 26: 2.14, 27: 1.513, 28: 3.17, 29: 2.77, 30: 3.57, 31: 2.78}, 'qsec': {0: 16.46, 1: 17.02, 2: 18.61, 3: 19.44, 4: 17.02, 5: 20.22, 6: 15.84, 7: 20.0, 8: 22.9, 9: 18.3, 10: 18.9, 11: 17.4, 12: 17.6, 13: 18.0, 14: 17.98, 15: 17.82, 16: 17.42, 17: 19.47, 18: 18.52, 19: 19.9, 20: 20.01, 21: 16.87, 22: 17.3, 23: 15.41, 24: 17.05, 25: 18.9, 26: 16.7, 27: 16.9, 28: 14.5, 29: 15.5, 30: 14.6, 31: 18.6}, 'vs': {0: 0, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: 0, 7: 1, 8: 1, 9: 1, 10: 1, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 1, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 0, 27: 1, 28: 0, 29: 0, 30: 0, 31: 1}, 'am': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 0, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1}, 'gear': {0: 4, 1: 4, 2: 4, 3: 3, 4: 3, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 4, 18: 4, 19: 4, 20: 3, 21: 3, 22: 3, 23: 3, 24: 3, 25: 4, 26: 5, 27: 5, 28: 5, 29: 5, 30: 5, 31: 4}, 'carb': {0: 4, 1: 4, 2: 1, 3: 1, 4: 2, 5: 1, 6: 4, 7: 2, 8: 2, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 4, 15: 4, 16: 4, 17: 1, 18: 2, 19: 1, 20: 1, 21: 2, 22: 2, 23: 4, 24: 2, 25: 1, 26: 2, 27: 2, 28: 4, 29: 6, 30: 8, 31: 2}}
This SO post was helpful in learning how to print the python dataframe like R does with the dput() function.
Now I import seaborn and create a histogram.
import seaborn as seaborn
seaborn.histplot(data=mtcars, x="mpg", bins = 30)
plt.suptitle("Mtcars", loc = 'left')
plt.title("histogram", loc = 'left')
plt.show()
This doesn't work as the title disappears.
So I clear out whatever is happening with the graphs and try again.
plt.figure().clear()
plt.close()
plt.cla()
plt.clf()
seaborn.histplot(data=mtcars, x="mpg", bins = 30)
plt.suptitle("Mtcars", horizontalalignment = 'left')
plt.title("histogram", loc = 'left')
plt.show()
But this doesn't work either. This time, the title is there but the alignment is wrong.
I'd like to put both the title and the subtitle on the left side.

Python trying to create a graph but it's blank

Here is the dataframe that I'm working with in python.
{'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32}, 'car': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710', 3: 'Hornet 4 Drive', 4: 'Hornet Sportabout', 5: 'Valiant', 6: 'Duster 360', 7: 'Merc 240D', 8: 'Merc 230', 9: 'Merc 280', 10: 'Merc 280C', 11: 'Merc 450SE', 12: 'Merc 450SL', 13: 'Merc 450SLC', 14: 'Cadillac Fleetwood', 15: 'Lincoln Continental', 16: 'Chrysler Imperial', 17: 'Fiat 128', 18: 'Honda Civic', 19: 'Toyota Corolla', 20: 'Toyota Corona', 21: 'Dodge Challenger', 22: 'AMC Javelin', 23: 'Camaro Z28', 24: 'Pontiac Firebird', 25: 'Fiat X1-9', 26: 'Porsche 914-2', 27: 'Lotus Europa', 28: 'Ford Pantera L', 29: 'Ferrari Dino', 30: 'Maserati Bora', 31: 'Volvo 142E'}, 'mpg': {0: 21.0, 1: 21.0, 2: 22.8, 3: 21.4, 4: 18.7, 5: 18.1, 6: 14.3, 7: 24.4, 8: 22.8, 9: 19.2, 10: 17.8, 11: 16.4, 12: 17.3, 13: 15.2, 14: 10.4, 15: 10.4, 16: 14.7, 17: 32.4, 18: 30.4, 19: 33.9, 20: 21.5, 21: 15.5, 22: 15.2, 23: 13.3, 24: 19.2, 25: 27.3, 26: 26.0, 27: 30.4, 28: 15.8, 29: 19.7, 30: 15.0, 31: 21.4}, 'cyl': {0: 6, 1: 6, 2: 4, 3: 6, 4: 8, 5: 6, 6: 8, 7: 4, 8: 4, 9: 6, 10: 6, 11: 8, 12: 8, 13: 8, 14: 8, 15: 8, 16: 8, 17: 4, 18: 4, 19: 4, 20: 4, 21: 8, 22: 8, 23: 8, 24: 8, 25: 4, 26: 4, 27: 4, 28: 8, 29: 6, 30: 8, 31: 4}, 'disp': {0: 160.0, 1: 160.0, 2: 108.0, 3: 258.0, 4: 360.0, 5: 225.0, 6: 360.0, 7: 146.7, 8: 140.8, 9: 167.6, 10: 167.6, 11: 275.8, 12: 275.8, 13: 275.8, 14: 472.0, 15: 460.0, 16: 440.0, 17: 78.7, 18: 75.7, 19: 71.1, 20: 120.1, 21: 318.0, 22: 304.0, 23: 350.0, 24: 400.0, 25: 79.0, 26: 120.3, 27: 95.1, 28: 351.0, 29: 145.0, 30: 301.0, 31: 121.0}, 'hp': {0: 110, 1: 110, 2: 93, 3: 110, 4: 175, 5: 105, 6: 245, 7: 62, 8: 95, 9: 123, 10: 123, 11: 180, 12: 180, 13: 180, 14: 205, 15: 215, 16: 230, 17: 66, 18: 52, 19: 65, 20: 97, 21: 150, 22: 150, 23: 245, 24: 175, 25: 66, 26: 91, 27: 113, 28: 264, 29: 175, 30: 335, 31: 109}, 'drat': {0: 3.9, 1: 3.9, 2: 3.85, 3: 3.08, 4: 3.15, 5: 2.76, 6: 3.21, 7: 3.69, 8: 3.92, 9: 3.92, 10: 3.92, 11: 3.07, 12: 3.07, 13: 3.07, 14: 2.93, 15: 3.0, 16: 3.23, 17: 4.08, 18: 4.93, 19: 4.22, 20: 3.7, 21: 2.76, 22: 3.15, 23: 3.73, 24: 3.08, 25: 4.08, 26: 4.43, 27: 3.77, 28: 4.22, 29: 3.62, 30: 3.54, 31: 4.11}, 'wt': {0: 2.62, 1: 2.875, 2: 2.32, 3: 3.215, 4: 3.44, 5: 3.46, 6: 3.57, 7: 3.19, 8: 3.15, 9: 3.44, 10: 3.44, 11: 4.07, 12: 3.73, 13: 3.78, 14: 5.25, 15: 5.424, 16: 5.345, 17: 2.2, 18: 1.615, 19: 1.835, 20: 2.465, 21: 3.52, 22: 3.435, 23: 3.84, 24: 3.845, 25: 1.935, 26: 2.14, 27: 1.513, 28: 3.17, 29: 2.77, 30: 3.57, 31: 2.78}, 'qsec': {0: 16.46, 1: 17.02, 2: 18.61, 3: 19.44, 4: 17.02, 5: 20.22, 6: 15.84, 7: 20.0, 8: 22.9, 9: 18.3, 10: 18.9, 11: 17.4, 12: 17.6, 13: 18.0, 14: 17.98, 15: 17.82, 16: 17.42, 17: 19.47, 18: 18.52, 19: 19.9, 20: 20.01, 21: 16.87, 22: 17.3, 23: 15.41, 24: 17.05, 25: 18.9, 26: 16.7, 27: 16.9, 28: 14.5, 29: 15.5, 30: 14.6, 31: 18.6}, 'vs': {0: 0, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: 0, 7: 1, 8: 1, 9: 1, 10: 1, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 1, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 0, 27: 1, 28: 0, 29: 0, 30: 0, 31: 1}, 'am': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 0, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1}, 'gear': {0: 4, 1: 4, 2: 4, 3: 3, 4: 3, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 4, 18: 4, 19: 4, 20: 3, 21: 3, 22: 3, 23: 3, 24: 3, 25: 4, 26: 5, 27: 5, 28: 5, 29: 5, 30: 5, 31: 4}, 'carb': {0: 4, 1: 4, 2: 1, 3: 1, 4: 2, 5: 1, 6: 4, 7: 2, 8: 2, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 4, 15: 4, 16: 4, 17: 1, 18: 2, 19: 1, 20: 1, 21: 2, 22: 2, 23: 4, 24: 2, 25: 1, 26: 2, 27: 2, 28: 4, 29: 6, 30: 8, 31: 2}}
Here is the code that I'm using. The subplot part I got off a datacamp module.
fig, ax = plt.subplot()
plt.show()
But when I go to plot the mtcars dataset, one variable against the other, I get a blank canvas. Why is that? I don't see how the code is different than what I am looking at on DataCamp.
ax.plot(mtcars['cyl'], mtcars['mpg'])
plt.show()
The answer from below is helpful and gets me closer to a solution but it is giving me lines instead of a scatterplot?
fig, ax = plt.subplot()
plt.show()

import matplotlib.pyplot as plt
plt.plot(df['cyl'], df['mpg'])
plt.show()
or:
ax = plt.subplot(2, 1, 1)
ax.plot(df['cyl'], df['mpg'])
plt.show()

Strange result from pymc3 for small change in the variable

I have below model for which I seek estimation of parameters using pymc3
import pandas as pd
import pymc3 as pm
import arviz as arviz
myData = pd.DataFrame.from_dict({
'Unnamed: 0': {
0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10,
10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20,
20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30,
30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38},
'y': {
0: 0.0079235409492941, 1: 0.0086530073429249, 2: 0.0297400780486734, 3: 0.0196358416326437, 4: 0.0023902064076204, 5: 0.0258055591736283, 6: 0.17394835142698, 7: 0.156463554455613, 8: 0.329388185725557, 9: 0.0076443508881763,
10: 0.0162081480398152, 11: 0.0, 12: 0.0015759139941696, 13: 0.420025972703085, 14: 0.0001226236519444, 15: 0.133061480234834, 16: 0.565454216154227, 17: 0.0002819734812997, 18: 0.000559715156383, 19: 0.0270686389659072,
20: 0.918300537689865, 21: 7.8262468302e-06, 22: 0.0073241434191945, 23: 0.0, 24: 0.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0, 29: 0.0,
30: 0.174071274611405, 31: 0.0432109713717948, 32: 0.0544400838264943, 33: 0.0, 34: 0.0907049925221286, 35: 0.616680102647887, 36: 0.0, 37: 0.0},
'x': {
0: 23.8187587698947, 1: 15.9991138359515, 2: 33.6495930512881, 3: 28.555818797764, 4: -52.2967967248258, 5: -91.3835208788233, 6: -73.9830692708321, 7: -5.16901145289629, 8: 29.8363012310241, 9: 10.6820057903939,
10: 19.4868517164395, 11: 15.4499668436458, 12: -17.0441644773509, 13: 10.7025053739577, 14: -8.6382953428539, 15: -32.8892974839165, 16: -15.8671863161348, 17: -11.237248036145, 18: -7.37978020066205, 19: -3.33500586334862,
20: -4.02629933182873, 21: -20.2413384726948, 22: -54.9094885578775, 23: -48.041459120976, 24: -52.3125732905322, 25: -35.6269065970458, 26: -62.0296155423529, 27: -49.0825017152659, 28: -73.0574478287598, 29: -50.9409090127938,
30: -63.4650928035253, 31: -55.1263264283842, 32: -52.2841103768755, 33: -61.2275334149805, 34: -74.2175990067417, 35: -68.2961107804698, 36: -76.6834643609286, 37: -70.16769103228}
})
with pm.Model() as myModel :
beta0 = pm.Normal('intercept', 0, 1)
beta1 = pm.Normal('x', 0, 1)
mu = beta0 + beta1 * myData['x'].values
pm.Bernoulli('obs', p = pm.invlogit(mu), observed = myData['y'].values)
with myModel :
calc = pm.sample(50000, tune = 10000, step = pm.Metropolis(), random_seed = 1000)
arviz.summary(calc, round_to = 10)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.537501 0.599667 -3.707061 -1.450243 0.004375 0.003118 18893.344191 22631.772985 1.000070
x 0.033750 0.024314 -0.007871 0.081619 0.000181 0.000133 18550.620475 20113.739639 1.000194
Now I changed above model to this,
mu = beta0 + beta1 * myData['x'].values * 0
With this change I get below result,
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
intercept -2.690874 0.546570 -3.698465 -1.643091 0.003611 0.002565 22980.471424 24806.935727 1.000036
x -0.013861 1.003612 -1.916176 1.826709 0.006874 0.005175 21336.662537 23299.680306 1.000084
I wonder if above estimate is correct. Should not I expect very small estimate for the coefficient beta1? I see hardly any change for this estimate except just change in sign.
Any pointer is highly appreciated.

"hardly any change for this estimate"
Seems like you are ignoring the sd, which has a strong change and is behaving as expected. That is, the first version yields 0.034 ± 0.024 (weakly positive); whereas the second correctly reverts to the prior with -0.014 ± 1.00.
Looking at the input data, none of this seems surprising:

Is there a formulaic approach to find the frequency of the sum of combinations?

I have 5 strawberries, 2 lemons, and a banana. For each possible combination of these (including selecting 0), there is a total number of objects. I ultimately want a list of the frequencies at which these sums appear.
[1 strawberry, 0 lemons, 0 bananas] = 1 objects
[2 strawberries, 0 lemons, 1 banana] = 3 objects
[0 strawberries, 1 lemon, 0 bananas] = 1 objects
[2 strawberries, 1 lemon, 0 bananas] = 3 objects
[3 strawberries, 0 lemons, 0 bananas] = 3 objects
For just the above selection of 5 combinations, "1" has a frequency of 2 and "3" has a frequency of 3.
Obviously there are far more possible combinations, each changing the frequency result. Is there a formulaic way to approach the problem to find the frequencies for an entire set of combinations?
Currently, I've set up a brute-force function in Python.
special_cards = {
'A':7, 'B':1, 'C':1, 'D':1, 'E':1, 'F':1, 'G':1, 'H':1, 'I':1, 'J':1, 'K':1, 'L':1,
'M':1, 'N':1, 'O':1, 'P':1, 'Q':1, 'R':1, 'S':1, 'T':1, 'U':1, 'V':1, 'W':1, 'X':1,
'Y':1, 'Z':1, 'AA':1, 'AB':1, 'AC':1, 'AD':1, 'AE':1, 'AF':1, 'AG':1, 'AH':1, 'AI':1, 'AJ':1,
'AK':1, 'AL':1, 'AM':1, 'AN':1, 'AO':1, 'AP':1, 'AQ':1, 'AR':1, 'AS':1, 'AT':1, 'AU':1, 'AV':1,
'AW':1, 'AX':1, 'AY':1
}
def _calc_dis_specials(special_cards):
"""Calculate the total combinations when special cards are factored in"""
# Create an iterator for special card combinations.
special_paths = _gen_dis_special_list(special_cards)
freq = {}
path_count = 0
for o_path in special_paths: # Loop through the iterator
path_count += 1 # Keep track of how many combinations we've evaluated thus far.
try: # I've been told I can use a collections.counter() object instead of try/except.
path_sum = sum(o_path) # Sum the path (counting objects)
new_count = freq[path_sum] + 1 # Try to increment the count for our sum.
freq.update({path_sum: new_count})
except KeyError:
freq.update({path_sum: 1})
print(f"{path_count:,}\n{freq}")
print(f"{path_count:,}\n{freq}")
# Do things with results yadda yadda
def _gen_dis_special_list(special_cards):
"""Generates an iterator for all combinations for special cards"""
product_args = []
for value in special_cards.values(): # A card's "value" is the maximum number that can be in a deck.
product_args.append(range(value+1)) # Populates product_args with lists of each card's possible count.
result = itertools.product(*product_args)
return result
However, for large numbers of object pools (50+) the factorial just gets out of hand. Billions upon billions of combinations. I need a formulaic approach.
Looking at some output, I notice a couple of things:
1
{0: 1}
2
{0: 1, 1: 1}
4
{0: 1, 1: 2, 2: 1}
8
{0: 1, 1: 3, 2: 3, 3: 1}
16
{0: 1, 1: 4, 2: 6, 3: 4, 4: 1}
32
{0: 1, 1: 5, 2: 10, 3: 10, 4: 5, 5: 1}
64
{0: 1, 1: 6, 2: 15, 3: 20, 4: 15, 5: 6, 6: 1}
128
{0: 1, 1: 7, 2: 21, 3: 35, 4: 35, 5: 21, 6: 7, 7: 1}
256
{0: 1, 1: 8, 2: 28, 3: 56, 4: 70, 5: 56, 6: 28, 7: 8, 8: 1}
512
{0: 1, 1: 9, 2: 36, 3: 84, 4: 126, 5: 126, 6: 84, 7: 36, 8: 9, 9: 1}
1,024
{0: 1, 1: 10, 2: 45, 3: 120, 4: 210, 5: 252, 6: 210, 7: 120, 8: 45, 9: 10, 10: 1}
2,048
{0: 1, 1: 11, 2: 55, 3: 165, 4: 330, 5: 462, 6: 462, 7: 330, 8: 165, 9: 55, 10: 11, 11: 1}
4,096
{0: 1, 1: 12, 2: 66, 3: 220, 4: 495, 5: 792, 6: 924, 7: 792, 8: 495, 9: 220, 10: 66, 11: 12, 12: 1}
8,192
{0: 1, 1: 13, 2: 78, 3: 286, 4: 715, 5: 1287, 6: 1716, 7: 1716, 8: 1287, 9: 715, 10: 286, 11: 78, 12: 13, 13: 1}
16,384
{0: 1, 1: 14, 2: 91, 3: 364, 4: 1001, 5: 2002, 6: 3003, 7: 3432, 8: 3003, 9: 2002, 10: 1001, 11: 364, 12: 91, 13: 14, 14: 1}
32,768
{0: 1, 1: 15, 2: 105, 3: 455, 4: 1365, 5: 3003, 6: 5005, 7: 6435, 8: 6435, 9: 5005, 10: 3003, 11: 1365, 12: 455, 13: 105, 14: 15, 15: 1}
65,536
{0: 1, 1: 16, 2: 120, 3: 560, 4: 1820, 5: 4368, 6: 8008, 7: 11440, 8: 12870, 9: 11440, 10: 8008, 11: 4368, 12: 1820, 13: 560, 14: 120, 15: 16, 16: 1}
131,072
{0: 1, 1: 17, 2: 136, 3: 680, 4: 2380, 5: 6188, 6: 12376, 7: 19448, 8: 24310, 9: 24310, 10: 19448, 11: 12376, 12: 6188, 13: 2380, 14: 680, 15: 136, 16: 17, 17: 1}
262,144
{0: 1, 1: 18, 2: 153, 3: 816, 4: 3060, 5: 8568, 6: 18564, 7: 31824, 8: 43758, 9: 48620, 10: 43758, 11: 31824, 12: 18564, 13: 8568, 14: 3060, 15: 816, 16: 153, 17: 18, 18: 1}
524,288
{0: 1, 1: 19, 2: 171, 3: 969, 4: 3876, 5: 11628, 6: 27132, 7: 50388, 8: 75582, 9: 92378, 10: 92378, 11: 75582, 12: 50388, 13: 27132, 14: 11628, 15: 3876, 16: 969, 17: 171, 18: 19, 19: 1}
1,048,576
{0: 1, 1: 20, 2: 190, 3: 1140, 4: 4845, 5: 15504, 6: 38760, 7: 77520, 8: 125970, 9: 167960, 10: 184756, 11: 167960, 12: 125970, 13: 77520, 14: 38760, 15: 15504, 16: 4845, 17: 1140, 18: 190, 19: 20, 20: 1}
2,097,152
{0: 1, 1: 21, 2: 210, 3: 1330, 4: 5985, 5: 20349, 6: 54264, 7: 116280, 8: 203490, 9: 293930, 10: 352716, 11: 352716, 12: 293930, 13: 203490, 14: 116280, 15: 54264, 16: 20349, 17: 5985, 18: 1330, 19: 210, 20: 21, 21: 1}
4,194,304
{0: 1, 1: 22, 2: 231, 3: 1540, 4: 7315, 5: 26334, 6: 74613, 7: 170544, 8: 319770, 9: 497420, 10: 646646, 11: 705432, 12: 646646, 13: 497420, 14: 319770, 15: 170544, 16: 74613, 17: 26334, 18: 7315, 19: 1540, 20: 231, 21: 22, 22: 1}
8,388,608
{0: 1, 1: 23, 2: 253, 3: 1771, 4: 8855, 5: 33649, 6: 100947, 7: 245157, 8: 490314, 9: 817190, 10: 1144066, 11: 1352078, 12: 1352078, 13: 1144066, 14: 817190, 15: 490314, 16: 245157, 17: 100947, 18: 33649, 19: 8855, 20: 1771, 21: 253, 22: 23, 23: 1}
16,777,216
{0: 1, 1: 24, 2: 276, 3: 2024, 4: 10626, 5: 42504, 6: 134596, 7: 346104, 8: 735471, 9: 1307504, 10: 1961256, 11: 2496144, 12: 2704156, 13: 2496144, 14: 1961256, 15: 1307504, 16: 735471, 17: 346104, 18: 134596, 19: 42504, 20: 10626, 21: 2024, 22: 276, 23: 24, 24: 1}
33,554,432
{0: 1, 1: 25, 2: 300, 3: 2300, 4: 12650, 5: 53130, 6: 177100, 7: 480700, 8: 1081575, 9: 2042975, 10: 3268760, 11: 4457400, 12: 5200300, 13: 5200300, 14: 4457400, 15: 3268760, 16: 2042975, 17: 1081575, 18: 480700, 19: 177100, 20: 53130, 21: 12650, 22: 2300, 23: 300, 24: 25, 25: 1}
67,108,864
{0: 1, 1: 26, 2: 325, 3: 2600, 4: 14950, 5: 65780, 6: 230230, 7: 657800, 8: 1562275, 9: 3124550, 10: 5311735, 11: 7726160, 12: 9657700, 13: 10400600, 14: 9657700, 15: 7726160, 16: 5311735, 17: 3124550, 18: 1562275, 19: 657800, 20: 230230, 21: 65780, 22: 14950, 23: 2600, 24: 325, 25: 26, 26: 1}
134,217,728
{0: 1, 1: 27, 2: 351, 3: 2925, 4: 17550, 5: 80730, 6: 296010, 7: 888030, 8: 2220075, 9: 4686825, 10: 8436285, 11: 13037895, 12: 17383860, 13: 20058300, 14: 20058300, 15: 17383860, 16: 13037895, 17: 8436285, 18: 4686825, 19: 2220075, 20: 888030, 21: 296010, 22: 80730, 23: 17550, 24: 2925, 25: 351, 26: 27, 27: 1}
268,435,456
{0: 1, 1: 28, 2: 378, 3: 3276, 4: 20475, 5: 98280, 6: 376740, 7: 1184040, 8: 3108105, 9: 6906900, 10: 13123110, 11: 21474180, 12: 30421755, 13: 37442160, 14: 40116600, 15: 37442160, 16: 30421755, 17: 21474180, 18: 13123110, 19: 6906900, 20: 3108105, 21: 1184040, 22: 376740, 23: 98280, 24: 20475, 25: 3276, 26: 378, 27: 28, 28: 1}
536,870,912
{0: 1, 1: 29, 2: 406, 3: 3654, 4: 23751, 5: 118755, 6: 475020, 7: 1560780, 8: 4292145, 9: 10015005, 10: 20030010, 11: 34597290, 12: 51895935, 13: 67863915, 14: 77558760, 15: 77558760, 16: 67863915, 17: 51895935, 18: 34597290, 19: 20030010, 20: 10015005, 21: 4292145, 22: 1560780, 23: 475020, 24: 118755, 25: 23751, 26: 3654, 27: 406, 28: 29, 29: 1}
1,073,741,824
{0: 1, 1: 30, 2: 435, 3: 4060, 4: 27405, 5: 142506, 6: 593775, 7: 2035800, 8: 5852925, 9: 14307150, 10: 30045015, 11: 54627300, 12: 86493225, 13: 119759850, 14: 145422675, 15: 155117520, 16: 145422675, 17: 119759850, 18: 86493225, 19: 54627300, 20: 30045015, 21: 14307150, 22: 5852925, 23: 2035800, 24: 593775, 25: 142506, 26: 27405, 27: 4060, 28: 435, 29: 30, 30: 1}
Note that I'm only printing when a new key (sum) is found.
I notice that
a new sum is found only on powers of 2 and
the results are symmetrical.
This hints to me that there's a formulaic approach that could work.
Any ideas on how to proceed?

Good news; there is a formula for this, and I'll explain the path there in case there is any confusion.
Let's look at your initial example: 5 strawberries (S), 2 lemons (L), and a banana (B). Let's lay out all of the fruits:
S S S S S L L B
We can actually rephrase the question now, because the number of times that 3, for example, will be the total number is the number of different ways you can pick 3 of the fruits from this list.
In statistics, the choose function (a.k.a nCk), answers just this question: how many ways are there to select a group of k items from a group of n items. This is computed as n!/((n-k)!*k!), where "!" is the factorial, a number multiplied by all numbers less than itself. As such the frequency of 3s would be (the number of fruits) "choose" (the total in question), or 8 choose 3. This is 8!/(5!*3!) = 56.

How to do group by on a multiindex in pandas?

Below is my dataframe. I made some transformations to create the category column and dropped the original column it was derived from. Now I need to do a group-by to remove the dups e.g. Love and Fashion can be rolled up via a groupby sum.
df.colunms = array([category, clicks, revenue, date, impressions, size], dtype=object)
df.values=
[[Love 0 0.36823 2013-11-04 380 300x250]
[Love 183 474.81522 2013-11-04 374242 300x250]
[Fashion 0 0.19434 2013-11-04 197 300x250]
[Fashion 9 18.26422 2013-11-04 13363 300x250]]
Here is the index that is created when I created the dataframe
print df.index
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])
I assume I want to drop the index, and create date, and category as a multiindex then do a groupby sum of the metrics. How do I do this in pandas dataframe?
df.head(15).to_dict()= {'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'}, 'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229}, 'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}}
Python is 2.7 and pandas is 0.7.0 on ubuntu 12.04. Below is the error I get if I run the below
import pandas
print pandas.__version__
df = pandas.DataFrame.from_dict(
{
'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'},
'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229},
'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
}
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()
Traceback (most recent call last):
File "/home/ubuntu/workspace/devops/reports/groupby_sub.py", line 9, in <module>
df.set_index(['date', 'category'], inplace=True)
File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1927, in set_index
raise Exception('Index has duplicate keys: %s' % duplicates)
Exception: Index has duplicate keys: [('2013-11-04', 'Celebs'), ('2013-11-04', 'Fashion'), ('2013-11-04', 'Health'), ('2013-11-04', 'Love'), ('2013-11-04', 'Movies')]

You can create the index on the existing dataframe. With the subset of data provided, this works for me:
import pandas
df = pandas.DataFrame.from_dict(
{
'category': {0: 'Love', 1: 'Love', 2: 'Fashion', 3: 'Fashion', 4: 'Hair', 5: 'Movies', 6: 'Movies', 7: 'Health', 8: 'Health', 9: 'Celebs', 10: 'Celebs', 11: 'Travel', 12: 'Weightloss', 13: 'Diet', 14: 'Bags'},
'impressions': {0: 380, 1: 374242, 2: 197, 3: 13363, 4: 4, 5: 189, 6: 60632, 7: 269, 8: 40189, 9: 138, 10: 66590, 11: 2227, 12: 22668, 13: 21707, 14: 229},
'date': {0: '2013-11-04', 1: '2013-11-04', 2: '2013-11-04', 3: '2013-11-04', 4: '2013-11-04', 5: '2013-11-04', 6: '2013-11-04', 7: '2013-11-04', 8: '2013-11-04', 9: '2013-11-04', 10: '2013-11-04', 11: '2013-11-04', 12: '2013-11-04', 13: '2013-11-04', 14: '2013-11-04'}, 'cpc_cpm_revenue': {0: 0.36823, 1: 474.81522000000001, 2: 0.19434000000000001, 3: 18.264220000000002, 4: 0.00080000000000000004, 5: 0.23613000000000001, 6: 81.391139999999993, 7: 0.27171000000000001, 8: 51.258200000000002, 9: 0.11536, 10: 83.966859999999997, 11: 3.43248, 12: 31.695889999999999, 13: 28.459320000000002, 14: 0.43524000000000002}, 'clicks': {0: 0, 1: 183, 2: 0, 3: 9, 4: 0, 5: 1, 6: 20, 7: 0, 8: 21, 9: 0, 10: 32, 11: 1, 12: 12, 13: 9, 14: 2}, 'size': {0: '300x250', 1: '300x250', 2: '300x250', 3: '300x250', 4: '300x250', 5: '300x250', 6: '300x250', 7: '300x250', 8: '300x250', 9: '300x250', 10: '300x250', 11: '300x250', 12: '300x250', 13: '300x250', 14: '300x250'}
}
)
df.set_index(['date', 'category'], inplace=True)
df.groupby(level=[0,1]).sum()
If you're having duplicate index issues with the full dataset, you'll need to clean up the data a bit. Remove the duplicate rows if that's amenable. If the duplicate rows are valid, then what sets them apart from each other? If you can add that to the dataframe and include it in the index, that's ideal. If not, just create a dummy column that defaults to 1, but can be 2 or 3 or ... N in the case of N duplicates -- and then include that field in the index as well.
Alternatively, I'm pretty sure you can skip the index creation and directly groupby with columns:
df.groupby(by=['date', 'category']).sum()
Again, that works on the subset of data that you posted.

I usually try to do it when I try to unstack a multi-index and it fails because there are duplicate values.
Here is the simple command that I run the find the problematic items:
df.groupby(level=df.index.names).count()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group values based on columns and conditions in pandas - python

Related

python title disappears when trying to align it to the left in seaborn

Python trying to create a graph but it's blank

Strange result from pymc3 for small change in the variable

Is there a formulaic approach to find the frequency of the sum of combinations?

How to do group by on a multiindex in pandas?

Categories

Resources