value_counts in a weighted manner with pandas - python

I have pandas dataframe such as
basket_id = [1,2,3,4,5]
continents = ['apple', 'apple orange', 'apple orange pear', 'pear apple', 'pear']
df = pd.DataFrame({'basket_id' : basket_id , 'continents ' : continents })
baskets are equal, say 18kg, and each basket has an equal amount from each of its fruits: basket 2 has 9kg apple and 9kg orange.
I want to know how much I have from each fruit. if each basket has only one type of fruit I could simply apply value_counts and multiply by 18. But now how could I get my answer?
I expect the following:
fruits = ['apple', 'orange', 'pear']
amounts = [42, 15, 33]
df1 = pd.DataFrame({'fruits' : fruits , 'amounts(kg)' : amounts })
df1
apples are 42kg: 18kg from basket 1, 9kg of basket 2, 6kg of basket 3, and 9kg of basket 4.

You can use Series.str.split then Series.explode now count how many fruits are in a basket using GroupBy.transform then use Series.rdiv to get relative weights in each basket, then groupby each fruit and take the sum.
out = df['continents'].str.split().explode()
amt = out.groupby(level=0).transform('count').rdiv(18).groupby(out).sum()
apple 42.0
orange 15.0
pear 33.0
Name: continents , dtype: float64
To get exact output as mentioned in question, you have to use Series.reset_index then Series.rename
amt.reset_index(name='amounts(kg)').rename(columns={'index':'fruit'})
fruit amounts(kg)
0 apple 42.0
1 orange 15.0
2 pear 33.0

So for each N items in a basket you want to add 18/N kg of each item? You can use a defaultdict(int), which generates default values for unknown entries by calling int() (which is 0) and add the amounts to that.
baskets = ['apple', 'apple orange', 'apple orange pear', 'pear apple', 'pear']
from collections import defaultdict
amounts = defaultdict(int)
for basket in baskets:
items = basket.split()
for item in items:
amounts[item] += 18 // len(items)
print(amounts)
# defaultdict(<class 'int'>, {'apple': 42, 'orange': 15, 'pear': 33})
# if you need a pandas output
import pandas as pd
print(pd.Series(amounts))
# apple 42
# orange 15
# pear 33
# dtype: int64

Related

How create structured DataDrame from multiples arrays and lists?

I'm using Python3.
I have three lists, one with the names of distributors of the products, the other with the list of products and another with the classification of the products and finally, I have two arrays.
Each one of the distributors offers 15 products, all of them.
distributors = ['d1', 'd2', 'd3', 'd4', 'd5']
products = ['apple', 'carrot', 'potato', 'avocado', 'pumkie', 'banana', 'kiwi', 'lettuce', 'tomato', 'pees', 'pear', 'berries', 'strawberries', 'blueberries', 'boxes']
tips = ['fruit', 'vegetables', 'random']
actual_prix = np.random.rand(15, 5)
prix_prox_year = np.random.rand(15,5)
The structure of the arrays is the following: the rows are the products in order and the columns are the distributors in order.
And the output that I need is the following:
Products Distributor Actual Next_year Type
0 apple d1 0.16147847 0.28173206 fruit
1 ... ... ... ... fruit
2 apple d5 ... ... fruit
... ... ... ... ... ...
15 boxes d5 ... ... random
This is just an example because my arrays have this size(1010, 33).
Any idea?
You can use product from itertools to create all of your interactions, the ordering is important to get the pattern of your data. For the arrays you'll need to tile so it's repeated for each element in tips and ravel into a single long array so that the length matches.
I changed one of your arrays to be an increasing count that way it's obvious what is going on.
Sample Data
import numpy as np
distributors = ['d1', 'd2', 'd3', 'd4', 'd5']
products = ['apple', 'carrot', 'potato', 'avocado', 'pumkie', 'banana',
'kiwi', 'lettuce', 'tomato', 'pees', 'pear', 'berries', 'strawberries',
'blueberries', 'boxes']
tips = ['fruit', 'vegetables', 'random']
actual_prix = np.arange(15*5).reshape(15,5)
prix_prox_year = np.random.rand(15,5)
from itertools import product
import pandas as pd
df = (pd.DataFrame([*product(products, tips, distributors)],
columns=['Products', 'Type', 'Distributor'])
.assign(Actual = np.tile(actual_prix, len(tips)).ravel(),
Next_year = np.tile(prix_prox_year, len(tips)).ravel()))
print(df)
Products Type Distributor Actual Next_year
0 apple fruit d1 0 0.391903
1 apple fruit d2 1 0.378865
2 apple fruit d3 2 0.056134
3 apple fruit d4 3 0.623146
4 apple fruit d5 4 0.879184
5 apple vegetables d1 0 0.391903
6 apple vegetables d2 1 0.378865
...
219 boxes vegetables d5 74 0.804884
220 boxes random d1 70 0.900764
221 boxes random d2 71 0.455267
222 boxes random d3 72 0.489814
223 boxes random d4 73 0.054597
224 boxes random d5 74 0.804884
You can use itertools and set functions to get the unique combinations and then put it into a dataframe.
import itertools
store = ["a", "b", "c"]
prods = ['apple', 'banana']
all_combinations = [list(zip(each_permutation, prods)) for each_permutation in itertools.permutations(store, len(prods))]

Use pandas python to filter and combine multiple cells into one cells Excel

I'm having a table with multiple columns and repeating data on all of the columns, except one (Address).
Last Name First Name Food Address
Brown James Apple 1
Brown Duke Apple 2
William Sam Apple 3
Miller Karen Apple 4
William Barry Orange 5
William Sam Orange 6
Brown James Orange 7
Miller Karen Banana 8
Brown Terry Banana 9
I want to merge all first names sharing the same last name and food into one entry, and keep the first address found when that condition is met.
The result will look like this:
Does anyone know any functions in (pandas) python that allow me to add multiple cells into one? Also, what would be the best approach to solve this?
Thanks!
This should do the trick. May be a faster way to put it all together, but in the end I pulled out rows with repeated First Names, transformed them, and put them back into the non-repeated dataframe.
I added another repeating row to be sure it worked with more than just two repeating names.
d = {'Last Name': ['Brown', 'Brown', 'Brown', 'William','Miller', 'William', 'William','Brown', 'Miller', 'Brown'],
'First Name':['Bill', 'James', 'Duke', 'Sam','Karen', 'Barry', 'Sam','James', 'Karen', 'Terry'],
'Food': ['Apple', 'Apple', 'Apple', 'Apple','Apple', 'Orange', 'Orange','Orange', 'Banana', 'Banana'],
'Address': [0, 1,2,3,4,5,6,7,8,9]}
df=pd.DataFrame(d)
grp_df = df.groupby(['Last Name', 'Food'])
df_nonrepeats = df[grp_df['First Name'].transform('count') == 1]
df_repeats = df[grp_df['First Name'].transform('count') > 1]
def concat_repeats(x):
dff = x.copy()
temp_list = ' '.join(dff['First Name'].tolist())
dff['First Name'] = temp_list
dff = dff.head(1)
return dff
grp_df = df_repeats.groupby(['Last Name', 'Food'])
df_concats = grp_df.apply(lambda x: concat_repeats(x))
df_final = pd.concat([df_nonrepeats, df_concats[['Last Name', 'First Name', 'Food', 'Address']]]).sort_values('Address').reset_index(drop=True)
print (df_final)

Pandas find all combinations of rows under a budget

I am trying to figure out a way to determine all possible combinations of rows within a DataFrame that are below a budget, so let's say I have a dataframe like this:
data = [['Bread', 9, 'Food'], ['Shoes', 20, 'Clothes'], ['Shirt', 15, 'Clothes'], ['Milk', 5, 'Drink'], ['Cereal', 8, 'Food'], ['Chips', 10, 'Food'], ['Beer', 15, 'Drink'], ['Popcorn', 3, 'Food'], ['Ice Cream', 6, 'Food'], ['Soda', 4, 'Drink']]
df = pd.DataFrame(data, columns = ['Item', 'Price', 'Type'])
df
Data
Item Price Type
Bread 9 Food
Shoes 20 Clothes
Shirt 15 Clothes
Milk 5 Drink
Cereal 8 Food
Chips 10 Food
Beer 15 Drink
Popcorn 3 Food
Ice Cream 6 Food
Soda 4 Drink
I want to find every combination that I could purchase for under a specific budget, let's say $35 for this example, while only getting one of each type. I'd like to get a new dataframe made up of rows for each combination that works with each item in its own column.
I was trying to do it using itertools.product, but this can combine and add columns, but what I really need to do is combine and add a specific column based on values in another column. I'm a bit stumped now.
Thanks for your help!
Here a way using powerset recipe from itertools with pd.concat
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
df_groups = pd.concat([df.reindex(l).assign(grp=n) for n, l in
enumerate(powerset(df.index))
if (df.loc[l, 'Price'].sum() <= 35)])
Outputs a single dataframe with groups of product that meet $35 condition:
Item Price Type grp
0 Bread 9 Food 1
1 Shoes 20 Clothes 2
2 Shirt 15 Clothes 3
3 Milk 5 Drink 4
4 Cereal 8 Food 5
.. ... ... ... ...
3 Milk 5 Drink 752
4 Cereal 8 Food 752
7 Popcorn 3 Food 752
8 Ice Cream 6 Food 752
9 Soda 4 Drink 752
How many ways this came combined to meet $35 budget?
df_groups['grp'].nunique()
Output:
258
Details:
There are a couple of tricks/methods that are used here. First, we are using the index of the dataframe to create groups of rows or items using powerset. Next, we are using enumerate to identify each group and with assign creating a new column in a dataframe with that group number from enumerate.
Modify to capture no more than one of each type:
df_groups = pd.concat([df.reindex(l).assign(grp=n) for n, l in
enumerate(powerset(df.index))
if ((df.loc[l, 'Price'].sum() <= 35) &
(df.loc[l, 'Type'].value_counts()==1).all())])
How many groups?
df_groups['grp'].nunique()
62
Get exactly one for each Type:
df_groups = pd.concat([df.reindex(l).assign(grp=n) for n, l in
enumerate(powerset(df.index))
if ((df.loc[l, 'Price'].sum() <= 35) &
(df.loc[l, 'Type'].value_counts()==1).all()&
(len(df.loc[l, 'Type']) == 3))])
How many groups?
df_groups['grp'].nunique()
21

Creating a new column with last 2 values after a str.split operation

I came across this extremely well explained similar question (Get last "column" after .str.split() operation on column in pandas DataFrame), and used some of the codes found. However, it's not the output that I would like.
raw_data = {
'category': ['sweet beverage, cola,sugared', 'healthy,salty snacks', 'juice,beverage,sweet', 'fruit juice,beverage', 'appetizer,salty crackers'],
'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df = pd.DataFrame(raw_data)
Objective is to extract the various categories from each row, and only use the last 2 categories to create a new column. I have this code, which works and I have the categories of interest as a new column.
df['my_col'] = df.categories.apply(lambda s:s.split(',')[-2:])
output
my_col
[cola,sugared]
[healthy,salty snacks]
[beverage,sweet]
...
However, it appears as a list. How can I not have it appear as a list? Can this be achieved? Thanks all!!!!!
I believe you need str.split, select last to lists and last str.join:
df['my_col'] = df.category.str.split(',').str[-2:].str.join(',')
print (df)
category product_name my_col
0 sweet beverage, cola,sugared coca-cola cola,sugared
1 healthy,salty snacks salted pistachios healthy,salty snacks
2 juice,beverage,sweet fruit juice beverage,sweet
3 fruit juice,beverage lemon tea fruit juice,beverage
4 appetizer,salty crackers roasted peanuts appetizer,salty crackers
EDIT:
In my opinion pandas str text functions are more recommended as apply with puru python string functions, because also working with NaNs and None.
raw_data = {
'category': [np.nan, 'healthy,salty snacks'],
'product_name': ['coca-cola', 'salted pistachios']}
df = pd.DataFrame(raw_data)
df['my_col'] = df.category.str.split(',').str[-2:].str.join(',')
print (df)
category product_name my_col
0 NaN coca-cola NaN
1 healthy,salty snacks salted pistachios healthy,salty snacks
df['my_col'] = df.category.apply(lambda s: ','.join(s.split(',')[-2:]))
AttributeError: 'float' object has no attribute 'split'
You can also use join in the lambda to the result of split:
df['my_col'] = df.category.apply(lambda s: ','.join(s.split(',')[-2:]))
df
Result:
category product_name my_col
0 sweet beverage, cola,sugared coca-cola cola,sugared
1 healthy,salty snacks salted pistachios healthy,salty snacks
2 juice,beverage,sweet fruit juice beverage,sweet
3 fruit juice,beverage lemon tea fruit juice,beverage
4 appetizer,salty crackers roasted peanuts appetizer,salty crackers

Using python and pandas to create combinations instead of permutations

I have a dataset structurally similar to the one created below. Imagine each user brought a bag with the corresponding fruit. I want to count all pairwise combinations (not permutations) of fruit options, and use them to generate a probability that a user owns the bag after pulling two fruits out of it. There is an assumption that no user ever brings two of the same fruit.
import pandas as pd
df = pd.DataFrame({'user':['Matt', 'Matt', 'Matt', 'Matt', 'Tom', 'Tom', 'Tom', 'Tom', 'Nick', 'Nick', 'Nick', 'Nick', 'Nick'], 'fruit': ['Plum', 'Apple', 'Orange', 'Pear', 'Grape', 'Apple', 'Orange', 'Banana', 'Orange', 'Grape', 'Apple', 'Banana', 'Tomato']})[['user', 'fruit']]
print df
My thought was to merge the dataframe back onto itself on user, and generate counts based on unique pairs of fruit_x and fruit_y.
df_merged = df.merge(df, how='inner', on='user')
print df_merged
Unfortunately the merge yields two types of unwanted results. Instances where a fruit has been merged back onto itself are easy to fix.
df_fix1 = df_merged.query('fruit_x != fruit_y')
gb_pair_user = df_fix1.groupby(['user', 'fruit_x', 'fruit_y'])
df_fix1['pair_user_count'] = gb_pair_user['user'].transform('count')
gb_pair = df_fix1.groupby(['fruit_x', 'fruit_y'])
df_fix1['pair_count'] = gb_pair['user'].transform('count')
df_fix1['probability'] = df_fix1['pair_user_count'] / df_fix1['pair_count'] *1.0
print df_fix1[['fruit_x', 'fruit_y', 'probability', 'user']]
The second type is where I'm stuck. There is no meaningful difference between Apple+Orange and Orange+Apple, so I'd like to remove one of those rows. If there is a way to get proper combinations, I'd be very interested in that, otherwise if anyone can suggest a hack to eliminate the duplicated information that would be great too.
You can take the advantage of combinations from itertools to create unique pair of combination of fruits for each user.
from itertools import combinations
def func(group):
return pd.DataFrame(list(combinations(group.fruit, 2)), columns=['fruit_x', 'fruit_y'])
df.groupby('user').apply(func).reset_index(level=1, drop=True)
fruit_x fruit_y
user
Matt Plum Apple
Matt Plum Orange
Matt Plum Pear
Matt Apple Orange
Matt Apple Pear
Matt Orange Pear
Nick Orange Grape
Nick Orange Apple
Nick Orange Banana
Nick Orange Tomato
Nick Grape Apple
Nick Grape Banana
Nick Grape Tomato
Nick Apple Banana
Nick Apple Tomato
Nick Banana Tomato
Tom Grape Apple
Tom Grape Orange
Tom Grape Banana
Tom Apple Orange
Tom Apple Banana
Tom Orange Banana
You can then calculate the probability according to your program logic.

Categories