How often is an item in a purchase? - python

I would like to calculate how often an item appears in a shopping cart.
I have a purchase recognizable by the buyerid. This buyerid can buy several items (also twice, triple,..., n-th times). Recognizable by itemid and description.
I would like to count the number of times an item ends up in a shopping cart. For example, out of 5 purchases, 3 people bought an apple, i.e. 0.6%. I would like to spend this on all products, how do I do that?
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
display(df.head(20))
My try:
# How many % of the articels are the same?
# this is wrong... :/
df_grouped = df.groupby('description').count()
display(df_grouped)
df_apple = df_grouped.iloc[0]
percentage = df_apple[0] / df.shape[0]
print(percentage)
[OUT] 0.45454545454545453
The mathematic formula
count of all buys (count_buy ) = 5
count how many an apple appears in the buy (count_apple) = 3
count_buy /count_apple = 3 / 5 = 0.6
What I would like to have (please note, I have not calculated the values, these are just dumy values)

Use GroupBy.size and divide by count of unique values of buyerid by Series.nunique:
print (df.groupby(['itemid','description']).size())
itemid description
0 Banana 1
1 Apple 3
Banana 2
Strawberry 1
2 Apple 2
4 Dog-Food 1
5 Beef 1
dtype: int64
purch = df['buyerid'].nunique()
df1 = df.groupby(['itemid','description']).size().div(purch).reset_index(name='percentage')
print (df1)
itemid description percentage
0 0 Banana 0.2
1 1 Apple 0.6
2 1 Banana 0.4
3 1 Strawberry 0.2
4 2 Apple 0.4
5 4 Dog-Food 0.2
6 5 Beef 0.2

I would group it and create a new column as follows:
df_grp = df.groupby('description')['buyerid'].sum().reset_index(name='total')
df_grp['percentage'] = (df_grp.total / df_grp.total.sum()) * 100
df_grp
Result:
description total percentage
0 Apple 11 39.285714
1 Banana 7 25.000000
2 Beef 4 14.285714
3 Dog-Food 4 14.285714
4 Strawberry 2 7.142857

As always, there are multiple ways to the gold, but i would go over pivoting as following:
Your input:
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
In a next step pivot the data with buyer_id as index and description as columns and replace NA with 0 as such
df2 = df.pivot_table(values='itemid', index='buyerid', columns='description', aggfunc='count')
df2 = df2.fillna(0)
resulting in
description Apple Banana Beef Dog-Food Strawberry
buyerid
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0
3 2.0 1.0 0.0 0.0 0.0
4 1.0 1.0 1.0 1.0 0.0
calling the mean on the table:
df_final = df2.mean()
results in
description
Apple 1.0
Banana 0.6
Beef 0.2
Dog-Food 0.2
Strawberry 0.2
dtype: float64

Related

Reshaping Data with Python [duplicate]

How can I melt a pandas data frame using multiple variable names and values? I have the following data frame that changes its shape in a for loop. In one of the for loop iterations, it looks like this:
ID Cat Class_A Class_B Prob_A Prob_B
1 Veg 1 2 0.9 0.1
2 Veg 1 2 0.8 0.2
3 Meat 1 2 0.6 0.4
4 Meat 1 2 0.3 0.7
5 Veg 1 2 0.2 0.8
I need to melt it in such a way that it looks like this:
ID Cat Class Prob
1 Veg 1 0.9
1 Veg 2 0.1
2 Veg 1 0.8
2 Veg 2 0.2
3 Meat 1 0.6
3 Meat 2 0.4
4 Meat 1 0.3
4 Meat 2 0.7
5 Veg 1 0.2
5 Veg 2 0.8
During the for loop the data frame will contain different number of classes with their probabilities. That is why I am looking for a general approach that is applicable in all my for loop iterations. I saw this question and this but they were not helpful!
You need lreshape by dict for specify categories:
d = {'Class':['Class_A', 'Class_B'], 'Prob':['Prob_A','Prob_B']}
df = pd.lreshape(df,d)
print (df)
Cat ID Class Prob
0 Veg 1 1 0.9
1 Veg 2 1 0.8
2 Meat 3 1 0.6
3 Meat 4 1 0.3
4 Veg 5 1 0.2
5 Veg 1 2 0.1
6 Veg 2 2 0.2
7 Meat 3 2 0.4
8 Meat 4 2 0.7
9 Veg 5 2 0.8
More dynamic solution:
Class = [col for col in df.columns if col.startswith('Class')]
Prob = [col for col in df.columns if col.startswith('Prob')]
df = pd.lreshape(df, {'Class':Class, 'Prob':Prob})
print (df)
Cat ID Class Prob
0 Veg 1 1 0.9
1 Veg 2 1 0.8
2 Meat 3 1 0.6
3 Meat 4 1 0.3
4 Veg 5 1 0.2
5 Veg 1 2 0.1
6 Veg 2 2 0.2
7 Meat 3 2 0.4
8 Meat 4 2 0.7
9 Veg 5 2 0.8
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
Or you can try this by using str.contain and pd.concat
DF1=df2.loc[:,df2.columns.str.contains('_A|Cat|ID')]
name=['ID','Cat','Class','Prob']
DF1.columns=name
DF2=df2.loc[:,df2.columns.str.contains('_B|Cat|ID')]
DF2.columns=name
pd.concat([DF1,DF2],axis=0)
Out[354]:
ID Cat Class Prob
0 1 Veg 1 0.9
1 2 Veg 1 0.8
2 3 Meat 1 0.6
3 4 Meat 1 0.3
4 5 Veg 1 0.2
0 1 Veg 2 0.1
1 2 Veg 2 0.2
2 3 Meat 2 0.4
3 4 Meat 2 0.7
4 5 Veg 2 0.8
The top voted answer uses the undocumented lreshape which may at some point get deprecated because of its similarity to pd.wide_to_long which is documented and can use directly here. By default suffix matches only to numbers. You must change this to match characters (here I just used any character).
pd.wide_to_long(df, stubnames=['Class', 'Prob'], i=['ID', 'Cat'], j='DROPME', suffix='.')\
.reset_index()\
.drop('DROPME', axis=1)
ID Cat Class Prob
0 1 Veg 1 0.9
1 1 Veg 2 0.1
2 2 Veg 1 0.8
3 2 Veg 2 0.2
4 3 Meat 1 0.6
5 3 Meat 2 0.4
6 4 Meat 1 0.3
7 4 Meat 2 0.7
8 5 Veg 1 0.2
9 5 Veg 2 0.8
You could also use pd.melt.
# Make DataFrame
df = pd.DataFrame({'ID' : [i for i in range(1,6)],
'Cat' : ['Veg']*2 + ['Meat']*2 + ['Veg'],
'Class_A' : [1]*5,
'Class_B' : [2]*5,
'Prob_A' : [0.9, 0.8, 0.6, 0.3, 0.2],
'Prob_B' : [0.1, 0.2, 0.4, 0.7, 0.8]})
# Make class dataframe and prob dataframe
df_class = df.loc[:, ['ID', 'Cat', 'Class_A', 'Class_B']]
df_prob = df.loc[:, ['ID', 'Cat', 'Prob_A', 'Prob_B']]
# Melt class dataframe and prob dataframe
df_class = df_class.melt(id_vars = ['ID',
'Cat'],
value_vars = ['Class_A',
'Class_B'],
value_name = 'Class')
df_prob = df_prob.melt(id_vars = ['ID',
'Cat'],
value_vars = ['Prob_A',
'Prob_B'],
value_name = 'Prob')
# Clean variable column so only 'A','B' is left in both dataframes
df_class.loc[:, 'variable'] = df_class.loc[:, 'variable'].str.partition('_')[2]
df_prob.loc[:, 'variable'] = df_prob.loc[:, 'variable'].str.partition('_')[2]
# Merge class dataframe with prob dataframe on 'ID', 'Cat', and 'variable';
# drop 'variable'; sort values by 'ID', 'Cat'
final = df_class.merge(df_prob,
how = 'inner',
on = ['ID',
'Cat',
'variable']).drop('variable', axis = 1).sort_values(by = ['ID',
'Cat'])
One option is pivot_longer from pyjanitor, which abstracts the process, and is efficient:
# pip install janitor
import janitor
df.pivot_longer(
index = ['ID', 'Cat'],
names_to = '.value',
names_pattern = '([a-zA-Z]+)_*')
ID Cat Class Prob
0 1 Veg 1 0.9
1 2 Veg 1 0.8
2 3 Meat 1 0.6
3 4 Meat 1 0.3
4 5 Veg 1 0.2
5 1 Veg 2 0.1
6 2 Veg 2 0.2
7 3 Meat 2 0.4
8 4 Meat 2 0.7
9 5 Veg 2 0.8
The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.

"Correlation matrix" for strings. Similarity of nominal data

Here is my data frame.
df
store_1 store_2 store_3 store_4
0 banana banana plum banana
1 orange tangerine pear orange
2 apple pear melon apple
3 pear raspberry pineapple plum
4 plum tomato peach tomato
I'm looking for the way to count number of co-occurrences in stores (to compare their similarity).
You can try something like this
import itertools as it
corr = lambda a,b: len(set(a).intersection(set(b)))/len(a)
c = [corr(*x) for x in it.combinations_with_replacement(df.T.values.tolist(),2)]
j = 0
x = []
for i in range(4, 0, -1): # replace 4 with df.shape[-1]
x.append([np.nan]*(4-i) + c[j:j+i])
j+= i
pd.DataFrame(x, columns=df.columns, index=df.columns)
Which yields
store_1 store_2 store_3 store_4
store_1 1.0 0.4 0.4 0.8
store_2 NaN 1.0 0.2 0.4
store_3 NaN NaN 1.0 0.2
store_4 NaN NaN NaN 1.0
If you wish to estimate the similarity of the stores with regards to their products, then you could use:
One hot encoding
Then each stores can be described by a vector with length of n = number of all products among all stores such as:
banana
orange
apple
pear
plum
tangerin
raspberry
tomato
melon
.
.
.
Store_1 then is described as 1 1 1 1 1 0 0 0 0 0 ...
Store_2 1 0 0 1 0 1 1 1 0 ...
This leaves you with a numerical vector, where you can compute dissimilarity measure such as Euclidean Distance.

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

Similar pandas statement of SQL where clause

Two tables:
Price list table PRICE_LIST:
ITEM PRICE
MANGO 5
BANANA 2
APPLE 2.5
ORANGE 1.5
Records of sale REC_SALE (list of transactions)
ITEM SELLLING_PRICE
MANGO 4
MANGO 3
BANANA 2
BANANA 1
ORANGE 0.5
ORANGE 4
Selecting records from REC_SALE where Items were sold less than the PRICE listed in the PRICE_LIST table
SELECT A.*
FROM
(
select RS.ITEM,RS.SELLING_PRICE, PL.PRICE AS ACTUAL_PRICE
from REC_SALE RS,
PRICE_LIST PL
where RS.ITEM = PL.ITEM
) A
WHERE A.SELLING_PRICE < A.ACTUAL_PRICE ;
Result:
ITEM SELLING_PRICE PRICE
MANGO 4 5
MANGO 3 5
BANANA 1 2
ORANGE 0.5 1.5
I have these same two tables as dataframe in jupyter notebook
what would be a equivalent python statement of the SQL statement above using pandas?
merge with .loc
df1.merge(df2).loc[lambda x : x.PRICE>x.SELLLING_PRICE]
Out[198]:
ITEM PRICE SELLLING_PRICE
0 MANGO 5.0 4.0
1 MANGO 5.0 3.0
3 BANANA 2.0 1.0
4 ORANGE 1.5 0.5
Use merge with query:
df = pd.merge(df1, df2, on='ITEM').query('PRICE >SELLLING_PRICE')
print (df)
ITEM PRICE SELLLING_PRICE
0 MANGO 5.0 4.0
1 MANGO 5.0 3.0
3 BANANA 2.0 1.0
4 ORANGE 1.5 0.5

Multi-column replacement in Pandas based on row selection

There are a plethora of questions on SO about how to select rows in a DataFrame and replace values in a column in those rows, but one use case is missing. To use the example DataFrame from this question,
In [1]: df
Out[1]:
apple banana cherry
0 0 3 good
1 1 4 bad
2 2 5 good
And this works if one wants to change a single column based on another:
df.loc[df.cherry == 'bad', 'apple'] = df.banana * 2
Or this sets the values in two columns:
df.loc[df.cherry == 'bad', ['apple', 'banana'] = np.nan
But this doesn't work:
df.loc[df.cherry == 'bad', ['apple', 'banana'] = [df.banana, df.apple]
, because apparently the right side is 3x2, while the left side is 1x2, hence the error message
ValueError: Must have equal len keys and value when setting with an ndarray
So I understand what the problem is, but what is the solution?
IIUC you can try:
df['a'] = df.apple * 3
df['b'] = df.banana * 2
print df
apple banana cherry a b
0 0 3 good 0 6
1 1 4 bad 3 8
2 2 5 good 6 10
df[['a', 'b']] = df.loc[df.cherry == 'bad', ['apple', 'banana']]
print df
apple banana cherry a b
0 0 3 good NaN NaN
1 1 4 bad 1.0 4.0
2 2 5 good NaN NaN
Or use conditions with values:
df['a'] = df.apple * 3
df['b'] = df.banana * 2
df.loc[df.cherry == 'bad', ['apple', 'banana']] =
df.loc[df.cherry == 'bad', ['a', 'b']].values
print df
apple banana cherry a b
0 0 3 good 0 6
1 3 8 bad 3 8
2 2 5 good 6 10
Another options with original columns:
print df[['apple','banana']].shift() * 2
apple banana
0 NaN NaN
1 12.0 6.0
2 2.0 8.0
df.loc[df.cherry == 'bad', ['apple', 'banana']] = df[['apple','banana']].shift() * 2
print df
apple banana cherry
0 6.0 3.0 good
1 12.0 6.0 bad
2 2.0 5.0 good

Categories