I am trying to figure out a way to determine all possible combinations of rows within a DataFrame that are below a budget, so let's say I have a dataframe like this:
data = [['Bread', 9, 'Food'], ['Shoes', 20, 'Clothes'], ['Shirt', 15, 'Clothes'], ['Milk', 5, 'Drink'], ['Cereal', 8, 'Food'], ['Chips', 10, 'Food'], ['Beer', 15, 'Drink'], ['Popcorn', 3, 'Food'], ['Ice Cream', 6, 'Food'], ['Soda', 4, 'Drink']]
df = pd.DataFrame(data, columns = ['Item', 'Price', 'Type'])
df
Data
Item Price Type
Bread 9 Food
Shoes 20 Clothes
Shirt 15 Clothes
Milk 5 Drink
Cereal 8 Food
Chips 10 Food
Beer 15 Drink
Popcorn 3 Food
Ice Cream 6 Food
Soda 4 Drink
I want to find every combination that I could purchase for under a specific budget, let's say $35 for this example, while only getting one of each type. I'd like to get a new dataframe made up of rows for each combination that works with each item in its own column.
I was trying to do it using itertools.product, but this can combine and add columns, but what I really need to do is combine and add a specific column based on values in another column. I'm a bit stumped now.
Thanks for your help!
Here a way using powerset recipe from itertools with pd.concat
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
df_groups = pd.concat([df.reindex(l).assign(grp=n) for n, l in
enumerate(powerset(df.index))
if (df.loc[l, 'Price'].sum() <= 35)])
Outputs a single dataframe with groups of product that meet $35 condition:
Item Price Type grp
0 Bread 9 Food 1
1 Shoes 20 Clothes 2
2 Shirt 15 Clothes 3
3 Milk 5 Drink 4
4 Cereal 8 Food 5
.. ... ... ... ...
3 Milk 5 Drink 752
4 Cereal 8 Food 752
7 Popcorn 3 Food 752
8 Ice Cream 6 Food 752
9 Soda 4 Drink 752
How many ways this came combined to meet $35 budget?
df_groups['grp'].nunique()
Output:
258
Details:
There are a couple of tricks/methods that are used here. First, we are using the index of the dataframe to create groups of rows or items using powerset. Next, we are using enumerate to identify each group and with assign creating a new column in a dataframe with that group number from enumerate.
Modify to capture no more than one of each type:
df_groups = pd.concat([df.reindex(l).assign(grp=n) for n, l in
enumerate(powerset(df.index))
if ((df.loc[l, 'Price'].sum() <= 35) &
(df.loc[l, 'Type'].value_counts()==1).all())])
How many groups?
df_groups['grp'].nunique()
62
Get exactly one for each Type:
df_groups = pd.concat([df.reindex(l).assign(grp=n) for n, l in
enumerate(powerset(df.index))
if ((df.loc[l, 'Price'].sum() <= 35) &
(df.loc[l, 'Type'].value_counts()==1).all()&
(len(df.loc[l, 'Type']) == 3))])
How many groups?
df_groups['grp'].nunique()
21
Related
I have problem with excel file to classify data in some columns and rows, I need to arrange merge cells to next column as a 1 row and next column go to beside them like this pictures:
Input:
Output for Dairy:
Summary:
first we took Dairy row, then we go to the second column in front of Dairy and get data in front of Dairy, then we go to the second column and in front of Milk to Mr. 1 we get the Butter to Mrs. 1 and Butter to Mrs. 2 and so on ...
After that we want to export it into an excel file like in Output picture.
I have written a code which get the first column data and finds all the data in front of it but I need to change it in order to get the data row by row like in the Output picture:
import pandas
import openpyxl
import xlwt
from xlwt import Workbook
df = pandas.read_excel('excel.xlsx')
result_first_level = []
for i, item in enumerate(df[df.columns[0]].values, 2):
if pandas.isna(item):
result_first_level[-1]['index'] = i
else:
result_first_level.append(dict(name=item, index=i, levels_name=[]))
for level in df.columns[1:]:
move_index = 0
for i, obj in enumerate(result_first_level):
if i == 0:
for item in df[level].values[0:obj['index'] - 1]:
if pandas.isna(item):
move_index += 1
continue
else:
obj['levels_name'].append(item)
move_index += 1
else:
for item in df[level].values[move_index:obj['index'] - 1]:
if pandas.isna(item):
move_index += 1
continue
else:
obj['levels_name'].append(item)
move_index += 1
# Workbook is created
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
style = xlwt.easyxf('font: bold 1')
move_index = 0
for item in result_first_level:
for member in item['levels_name']:
sheet1.write(move_index, 0, item['name'], style)
sheet1.write(move_index, 1, member)
move_index += 1
wb.save('test.xls')
download Input File excel from here
Thanks for helping!
First, fill forward your data to fill blank cells with the last valid value the create an ordered collection using pd.CategoricalDtype to sort the product column. Finally, you have just to iterate over columns pairwise and rename columns to allow concatenate. The last step is to sort your rows by product value.
import pandas as pd
# Prepare your dataframe
df = pd.read_excel('input.xlsx').dropna(how='all')
df.update(df.iloc[:, :-1].ffill())
df = df.drop_duplicates()
# Get keys to sort data in the final output
cats = pd.CategoricalDtype(df.T.melt()['value'].dropna().unique(), ordered=True)
# Group pairwise values
data = []
for cols in zip(df.columns, df.columns[1:]):
col_mapping = dict(zip(cols, ['product', 'subproduct']))
data.append(df[list(cols)].rename(columns=col_mapping))
# Merge all data
out = pd.concat(data).drop_duplicates().dropna() \
.astype(cats).sort_values('product').reset_index(drop=True)
Output:
>>> cats
CategoricalDtype(categories=['Dairy', 'Milk to Mr.1', 'Butter to Mrs.1',
'Butter to Mrs.2', 'Cheese to Miss 2 ', 'Cheese to Mr.2',
'Milk to Miss.1', 'Milk to Mr.5', 'yoghurt to Mr.3',
'Milk to Mr.6', 'Fruits', 'Apples to Mr.6',
'Limes to Miss 5', 'Oranges to Mr.7', 'Plumbs to Miss 5',
'apple for mr 2', 'Foods & Drinks', 'Chips to Mr1',
'Jam to Mr 2.', 'Coca to Mr 5', 'Cookies to Mr1.',
'Coca to Mr 7', 'Coca to Mr 6', 'Juice to Miss 1',
'Jam to Mr 3.', 'Ice cream to Miss 3.', 'Honey to Mr 5',
'Cake to Mrs. 2', 'Honey to Miss 2',
'Chewing gum to Miss 7.'], ordered=True)
>>> out
product subproduct
0 Dairy Milk to Mr.1
1 Dairy Cheese to Mr.2
2 Milk to Mr.1 Butter to Mrs.1
3 Milk to Mr.1 Butter to Mrs.2
4 Butter to Mrs.2 Cheese to Miss 2
5 Cheese to Mr.2 Milk to Miss.1
6 Cheese to Mr.2 yoghurt to Mr.3
7 Milk to Miss.1 Milk to Mr.5
8 yoghurt to Mr.3 Milk to Mr.6
9 Fruits Apples to Mr.6
10 Fruits Oranges to Mr.7
11 Apples to Mr.6 Limes to Miss 5
12 Oranges to Mr.7 Plumbs to Miss 5
13 Plumbs to Miss 5 apple for mr 2
14 Foods & Drinks Chips to Mr1
15 Foods & Drinks Juice to Miss 1
16 Foods & Drinks Cake to Mrs. 2
17 Chips to Mr1 Jam to Mr 2.
18 Chips to Mr1 Cookies to Mr1.
19 Jam to Mr 2. Coca to Mr 5
20 Cookies to Mr1. Coca to Mr 6
21 Cookies to Mr1. Coca to Mr 7
22 Juice to Miss 1 Honey to Mr 5
23 Juice to Miss 1 Jam to Mr 3.
24 Jam to Mr 3. Ice cream to Miss 3.
25 Cake to Mrs. 2 Chewing gum to Miss 7.
26 Cake to Mrs. 2 Honey to Miss 2
This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 1 year ago.
I have the following data frame that has been obtained by applying df.groupby(['category', 'unit_quantity']).count()
category
unit_quantity
Count
banana
1EA
5
eggs
100G
22
100ML
1
full cream milk
100G
5
100ML
1
1L
38
Let's call this latter dataframe as grouped. I want to find a way to regroup using columns unit_quantity and Count it and get
category
unit_quantity
Count
Most Frequent unit_quantity
banana
1EA
5
1EA
eggs
100G
22
100G
100ML
1
100G
full cream milk
100G
5
1L
100ML
1
1L
1L
38
1L
Now, I tried to apply grouped.groupby(level=1).max() which gives me
unit_quantity
100G
22
100ML
1
1EA
5
1L
38
Now, because the indices of the latter and grouped do not coincide, I cannot join it using .merge. Does someone know how to solve this issue?
Thanks in advance
Starting from your DataFrame :
>>> import pandas as pd
>>> df = pd.DataFrame({'category': ['banana', 'eggs', 'eggs', 'full cream milk', 'full cream milk', 'full cream milk'],
... 'unit_quantity': ['1EA', '100G', '100ML', '100G', '100ML', '1L'],
... 'Count': [5, 22, 1, 5, 1, 38],},
... index = [0, 1, 2, 3, 4, 5])
>>> df
category unit_quantity Count
0 banana 1EA 5
1 eggs 100G 22
2 eggs 100ML 1
3 full cream milk 100G 5
4 full cream milk 100ML 1
5 full cream milk 1L 38
You can use the transform method applied on max of the column Count in order to keep your category and unit_quantity values :
>>> idx = df.groupby(['unit_quantity'])['Count'].transform(max) == df['Count']
>>> df[idx]
category unit_quantity Count
0 banana 1EA 5
1 eggs 100G 22
2 eggs 100ML 1
4 full cream milk 100ML 1
5 full cream milk 1L 38
I have pandas dataframe such as
basket_id = [1,2,3,4,5]
continents = ['apple', 'apple orange', 'apple orange pear', 'pear apple', 'pear']
df = pd.DataFrame({'basket_id' : basket_id , 'continents ' : continents })
baskets are equal, say 18kg, and each basket has an equal amount from each of its fruits: basket 2 has 9kg apple and 9kg orange.
I want to know how much I have from each fruit. if each basket has only one type of fruit I could simply apply value_counts and multiply by 18. But now how could I get my answer?
I expect the following:
fruits = ['apple', 'orange', 'pear']
amounts = [42, 15, 33]
df1 = pd.DataFrame({'fruits' : fruits , 'amounts(kg)' : amounts })
df1
apples are 42kg: 18kg from basket 1, 9kg of basket 2, 6kg of basket 3, and 9kg of basket 4.
You can use Series.str.split then Series.explode now count how many fruits are in a basket using GroupBy.transform then use Series.rdiv to get relative weights in each basket, then groupby each fruit and take the sum.
out = df['continents'].str.split().explode()
amt = out.groupby(level=0).transform('count').rdiv(18).groupby(out).sum()
apple 42.0
orange 15.0
pear 33.0
Name: continents , dtype: float64
To get exact output as mentioned in question, you have to use Series.reset_index then Series.rename
amt.reset_index(name='amounts(kg)').rename(columns={'index':'fruit'})
fruit amounts(kg)
0 apple 42.0
1 orange 15.0
2 pear 33.0
So for each N items in a basket you want to add 18/N kg of each item? You can use a defaultdict(int), which generates default values for unknown entries by calling int() (which is 0) and add the amounts to that.
baskets = ['apple', 'apple orange', 'apple orange pear', 'pear apple', 'pear']
from collections import defaultdict
amounts = defaultdict(int)
for basket in baskets:
items = basket.split()
for item in items:
amounts[item] += 18 // len(items)
print(amounts)
# defaultdict(<class 'int'>, {'apple': 42, 'orange': 15, 'pear': 33})
# if you need a pandas output
import pandas as pd
print(pd.Series(amounts))
# apple 42
# orange 15
# pear 33
# dtype: int64
I'd would to generate more than 100 rows randomly and keep the link within the observations.
Below my example :
There are 4 variables which are Country, Category, Product and Price.
And Category, Product need to have a link together.
import random as rd
import pandas as pd
Country = []
Category = []
Product = []
Price = []
for i in range(1000):
Country.append(rd.choice(['England','Germany','France','USA','China','Japan']))
Category.append(rd.choice(['Electronics','home appliances','Computer','Food','Bedding']))
Product.append(rd.choice(['Iphone 6S','Samsung Fridge','PC ASUS','Cheese','Bed']))
Price.append(rd.randint(10,10000))
data = pd.DataFrame(data = {'Country':Country,'Category':Category,'Product':Product,'Price':Price})
When I executed the code above, Category observations aren't with their corresponding Product observations. For example you could have a row with Electronics (Category) and Cheese (Product) and it makes no sens obviously.
Any ideas would be appreciated
Thank you in advance
You can use Series.map for new column by dictionary with zip of lists after generating DataFrame without column Product:
Also append to lists is not necessary, faster is use numpy.random.choice and numpy.random.randint
import numpy as np
N = 10000
L0 = ['England','Germany','France','USA','China','Japan']
L1 = ['Electronics','home appliances','Computer','Food','Bedding']
L2 = ['Iphone 6S','Samsung Fridge','PC ASUS','Cheese','Bed']
d = dict(zip(L1, L2))
print (d)
{'Electronics': 'Iphone 6S', 'home appliances': 'Samsung Fridge',
'Computer': 'PC ASUS', 'Food': 'Cheese', 'Bedding': 'Bed'}
data = pd.DataFrame(data = {'Country':np.random.choice(L0, size=N),
'Category':np.random.choice(L1, size=N),
'Price':np.random.randint(10,size=N)})
data['Product'] = data['Category'].map(d)
print (data)
Country Category Price Product
0 Germany Food 1 Cheese
1 England Food 6 Cheese
2 Japan Bedding 3 Bed
3 France Electronics 1 Iphone 6S
4 Japan home appliances 8 Samsung Fridge
... ... ... ...
9995 England Electronics 3 Iphone 6S
9996 China Electronics 1 Iphone 6S
9997 Germany Bedding 0 Bed
9998 USA Electronics 3 Iphone 6S
9999 Germany home appliances 6 Samsung Fridge
[10000 rows x 4 columns]
I have a dataset
Item Type market_share
Office Supplies 10
Baby Food 20
Vegetables 10
Meat 30
Personal Care 10
Household 20
I want to club all the rows except Baby Food column so that my dataset will look like
Item Type market_share
Others 80
Baby Food 20
How can I do that, basically club all the rows,sum them and put them as others.
You can use:
df.groupby(df['Item Type'].eq('Baby Food').map({True:'Baby Food',False:'Others'})).sum()
market_share
Item Type
Baby Food 20
Others 80
Create array or Series by condition or by Series.map and convert missing values to NaN and aggregate sum:
s = np.where(df['Item Type'] == 'Baby Food', 'Baby Food', 'Others')
print (s)
['Others' 'Baby Food' 'Others' 'Others' 'Others' 'Others']
s = df['Item Type'].map({'Baby Food':'Baby Food'}).fillna('Others')
print (s)
0 Others
1 Baby Food
2 Others
3 Others
4 Others
5 Others
Name: Item Type, dtype: object
df = df.groupby(s)['market_share'].sum().rename_axis('Item Type').reset_index()
print (df)
Item Type market_share
0 Baby Food 20
1 Others 80
Use np.where -
df['market_share_2'] = np.where(df['Item Type'].values=='Baby Food', 'Baby Food', 'Others')
Output
Item Type market_share market_share_2
0 Office Supplies 10 Others
1 Baby Food 20 Baby Food
2 Vegetables 10 Others
3 Meat 30 Others
4 Personal_Care 10 Others
5 Household 20 Others
Then use value_counts() -
df['market_share_2'].value_counts()
Others 5
Baby Food 1
Name: market_share_2, dtype: int64
TLDR;
pd.Series(np.where(df['Item Type'].values=='Baby Food', 'Baby Food', 'Others')).value_counts()
You can use the except function != and the is function ==.
df[df['market_share'] != 'Baby Food'].sum()
df[df['market_share'] == 'Baby Food'].sum()