Similar pandas statement of SQL where clause - python

Two tables:
Price list table PRICE_LIST:
ITEM PRICE
MANGO 5
BANANA 2
APPLE 2.5
ORANGE 1.5
Records of sale REC_SALE (list of transactions)
ITEM SELLLING_PRICE
MANGO 4
MANGO 3
BANANA 2
BANANA 1
ORANGE 0.5
ORANGE 4
Selecting records from REC_SALE where Items were sold less than the PRICE listed in the PRICE_LIST table
SELECT A.*
FROM
(
select RS.ITEM,RS.SELLING_PRICE, PL.PRICE AS ACTUAL_PRICE
from REC_SALE RS,
PRICE_LIST PL
where RS.ITEM = PL.ITEM
) A
WHERE A.SELLING_PRICE < A.ACTUAL_PRICE ;
Result:
ITEM SELLING_PRICE PRICE
MANGO 4 5
MANGO 3 5
BANANA 1 2
ORANGE 0.5 1.5
I have these same two tables as dataframe in jupyter notebook
what would be a equivalent python statement of the SQL statement above using pandas?

merge with .loc
df1.merge(df2).loc[lambda x : x.PRICE>x.SELLLING_PRICE]
Out[198]:
ITEM PRICE SELLLING_PRICE
0 MANGO 5.0 4.0
1 MANGO 5.0 3.0
3 BANANA 2.0 1.0
4 ORANGE 1.5 0.5

Use merge with query:
df = pd.merge(df1, df2, on='ITEM').query('PRICE >SELLLING_PRICE')
print (df)
ITEM PRICE SELLLING_PRICE
0 MANGO 5.0 4.0
1 MANGO 5.0 3.0
3 BANANA 2.0 1.0
4 ORANGE 1.5 0.5

Related

Python Pandas count function on condition and subset

i have a dataframe like this
F_Class Product Packages
Apple Apple_A 1
Apple Apple_A 2
Apple Apple_A 1
Apple Apple_B 2
Bananas Banana_A n.a.
Bananas Banana_A n.a.
I want to build the following count function to count the items in my dataframe like shown below.
The Function should count by the Subset ['F_Class','Product']
If df['Packages'] == 2 then increase by +2 else increase by +1
The result should look like this:
F_Class Product Packages Counter
Apple Apple_A 1 1
Apple Apple_A 2 3
Apple Apple_A 1 4
Apple Apple_B 2 2
Bananas Banana_A n.a. 1
Bananas Banana_A n.a. 2
If need sum by Packages numbers use DataFrameGroupBy.cumsum with replace missing values to 1:
df['Packages'] = pd.to_numeric(df['Packages'], errors='coerce')
df['Counter'] = (df.assign(Packages = df['Packages'].fillna(1).astype(int))
.groupby(['F_Class','Product'])['Packages'].cumsum())
print (df)
F_Class Product Packages Counter
0 Apple Apple_A 1.0 1
1 Apple Apple_A 2.0 3
2 Apple Apple_A 1.0 4
3 Apple Apple_B 2.0 2
4 Bananas Banana_A NaN 1
5 Bananas Banana_A NaN 2
Detail:
print (df.assign(Packages = df['Packages'].fillna(1).astype(int)))
F_Class Product Packages
0 Apple Apple_A 1
1 Apple Apple_A 2
2 Apple Apple_A 1
3 Apple Apple_B 2
4 Bananas Banana_A 1
5 Bananas Banana_A 1
Use df.groupby() together with df.transform() as follows:
df['Counter'] = (df.groupby(['F_Class','Product'])['Packages']
.transform(lambda x: x.eq('2').add(1).cumsum()))
print(df)
F_Class Product Packages Counter
0 Apple Apple_A 1 1
1 Apple Apple_A 2 3
2 Apple Apple_A 1 4
3 Apple Apple_B 2 2
4 Bananas Banana_A n.a. 1
5 Bananas Banana_A n.a. 2
If your values in column Packages are integer rather than string, modify '2' to 2:
df['Counter'] = (df.groupby(['F_Class','Product'])['Packages']
.transform(lambda x: x.eq(2).add(1).cumsum()))

How often is an item in a purchase?

I would like to calculate how often an item appears in a shopping cart.
I have a purchase recognizable by the buyerid. This buyerid can buy several items (also twice, triple,..., n-th times). Recognizable by itemid and description.
I would like to count the number of times an item ends up in a shopping cart. For example, out of 5 purchases, 3 people bought an apple, i.e. 0.6%. I would like to spend this on all products, how do I do that?
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
display(df.head(20))
My try:
# How many % of the articels are the same?
# this is wrong... :/
df_grouped = df.groupby('description').count()
display(df_grouped)
df_apple = df_grouped.iloc[0]
percentage = df_apple[0] / df.shape[0]
print(percentage)
[OUT] 0.45454545454545453
The mathematic formula
count of all buys (count_buy ) = 5
count how many an apple appears in the buy (count_apple) = 3
count_buy /count_apple = 3 / 5 = 0.6
What I would like to have (please note, I have not calculated the values, these are just dumy values)
Use GroupBy.size and divide by count of unique values of buyerid by Series.nunique:
print (df.groupby(['itemid','description']).size())
itemid description
0 Banana 1
1 Apple 3
Banana 2
Strawberry 1
2 Apple 2
4 Dog-Food 1
5 Beef 1
dtype: int64
purch = df['buyerid'].nunique()
df1 = df.groupby(['itemid','description']).size().div(purch).reset_index(name='percentage')
print (df1)
itemid description percentage
0 0 Banana 0.2
1 1 Apple 0.6
2 1 Banana 0.4
3 1 Strawberry 0.2
4 2 Apple 0.4
5 4 Dog-Food 0.2
6 5 Beef 0.2
I would group it and create a new column as follows:
df_grp = df.groupby('description')['buyerid'].sum().reset_index(name='total')
df_grp['percentage'] = (df_grp.total / df_grp.total.sum()) * 100
df_grp
Result:
description total percentage
0 Apple 11 39.285714
1 Banana 7 25.000000
2 Beef 4 14.285714
3 Dog-Food 4 14.285714
4 Strawberry 2 7.142857
As always, there are multiple ways to the gold, but i would go over pivoting as following:
Your input:
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
In a next step pivot the data with buyer_id as index and description as columns and replace NA with 0 as such
df2 = df.pivot_table(values='itemid', index='buyerid', columns='description', aggfunc='count')
df2 = df2.fillna(0)
resulting in
description Apple Banana Beef Dog-Food Strawberry
buyerid
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0
3 2.0 1.0 0.0 0.0 0.0
4 1.0 1.0 1.0 1.0 0.0
calling the mean on the table:
df_final = df2.mean()
results in
description
Apple 1.0
Banana 0.6
Beef 0.2
Dog-Food 0.2
Strawberry 0.2
dtype: float64

Add values to the sum of a column based on another column in csv using Pandas Python

Lets say i have this dataframe:
Fruits Price Quantity
apple 12 10
pear 50 5
kiwi 42 20
kiwi 30 35
I want to do the sum like this grouping by fruits:
df.groupby(['Fruits'])['Price'].sum()
All good until now, but i want the price to be added to the sum halfed (price/2) for the columns where the quantity is above 10. How do i do this?
You can try making changes to the dataframe first and then calculate the sum after grouping.
df_clone = df.copy()
df_clone['Price'] = [df_clone['Price'].loc[i]/2 if df_clone['Quantity'].loc[i]<10 else df_clone['Price'].loc[i] for i in range(df_clone.shape[0])]
print(df_clone)
which will give:
Fruits Price Quantity
0 apple 12.0 10
1 pear 50.0 5
2 kiwi 21.0 20
3 kiwi 15.0 35
and now you can group this new dataframe to get your output:
df_clone.groupby(['Fruits'])['Price'].sum()
which results in:
Fruits
apple 12.0
kiwi 36.0
pear 50.0
Name: Price, dtype: float64

"Correlation matrix" for strings. Similarity of nominal data

Here is my data frame.
df
store_1 store_2 store_3 store_4
0 banana banana plum banana
1 orange tangerine pear orange
2 apple pear melon apple
3 pear raspberry pineapple plum
4 plum tomato peach tomato
I'm looking for the way to count number of co-occurrences in stores (to compare their similarity).
You can try something like this
import itertools as it
corr = lambda a,b: len(set(a).intersection(set(b)))/len(a)
c = [corr(*x) for x in it.combinations_with_replacement(df.T.values.tolist(),2)]
j = 0
x = []
for i in range(4, 0, -1): # replace 4 with df.shape[-1]
x.append([np.nan]*(4-i) + c[j:j+i])
j+= i
pd.DataFrame(x, columns=df.columns, index=df.columns)
Which yields
store_1 store_2 store_3 store_4
store_1 1.0 0.4 0.4 0.8
store_2 NaN 1.0 0.2 0.4
store_3 NaN NaN 1.0 0.2
store_4 NaN NaN NaN 1.0
If you wish to estimate the similarity of the stores with regards to their products, then you could use:
One hot encoding
Then each stores can be described by a vector with length of n = number of all products among all stores such as:
banana
orange
apple
pear
plum
tangerin
raspberry
tomato
melon
.
.
.
Store_1 then is described as 1 1 1 1 1 0 0 0 0 0 ...
Store_2 1 0 0 1 0 1 1 1 0 ...
This leaves you with a numerical vector, where you can compute dissimilarity measure such as Euclidean Distance.

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

Categories