"Correlation matrix" for strings. Similarity of nominal data

"Correlation matrix" for strings. Similarity of nominal data - python

Here is my data frame.
df
store_1 store_2 store_3 store_4
0 banana banana plum banana
1 orange tangerine pear orange
2 apple pear melon apple
3 pear raspberry pineapple plum
4 plum tomato peach tomato
I'm looking for the way to count number of co-occurrences in stores (to compare their similarity).

You can try something like this
import itertools as it
corr = lambda a,b: len(set(a).intersection(set(b)))/len(a)
c = [corr(*x) for x in it.combinations_with_replacement(df.T.values.tolist(),2)]
j = 0
x = []
for i in range(4, 0, -1): # replace 4 with df.shape[-1]
x.append([np.nan]*(4-i) + c[j:j+i])
j+= i
pd.DataFrame(x, columns=df.columns, index=df.columns)
Which yields
store_1 store_2 store_3 store_4
store_1 1.0 0.4 0.4 0.8
store_2 NaN 1.0 0.2 0.4
store_3 NaN NaN 1.0 0.2
store_4 NaN NaN NaN 1.0

If you wish to estimate the similarity of the stores with regards to their products, then you could use:
One hot encoding
Then each stores can be described by a vector with length of n = number of all products among all stores such as:
banana
orange
apple
pear
plum
tangerin
raspberry
tomato
melon
.
.
.
Store_1 then is described as 1 1 1 1 1 0 0 0 0 0 ...
Store_2 1 0 0 1 0 1 1 1 0 ...
This leaves you with a numerical vector, where you can compute dissimilarity measure such as Euclidean Distance.

Related

Search N consecutive rows with same value in one dataframe

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.

Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Co-occurence matrix from two data frames. Python

I have two data frames, Food and Drink.
food = {'fruit':['Apple', np.nan, 'Apple'],
'food':['Cake', 'Bread', np.nan]}
# Create DataFrame
food = pd.DataFrame(food)
fruit food
0 Apple Cake
1 NaN Bread
2 Apple NaN
drink = {'smoothie':['S_Strawberry', 'S_Watermelon', np.nan],
'tea':['T_white', np.nan, 'T_green']}
# Create DataFrame
drink = pd.DataFrame(drink)
smoothie tea
0 S_Strawberry T_white
1 S_Watermelon NaN
2 NaN T_green
The rows represent specific customers.
I would like to make a co-occurrence matrix of food and drinks.
expected outcome: (columns and ids do not have to be in this order)
Apple Bread Cake
S_Strawberry 1.0 NaN 1.0
S_Watermelon NaN 1.0 NaN
T_white 1.0 NaN 1.0
T_green 1.0 NaN NaN
so far I can make a co-occurrence matrix for each of the df but I don't know how I would bind the two data frames.
thank you.

I think you want pd.get_dummies and matrix multiplication:
pd.get_dummies(drink). T # pd.get_dummies(food)
Output:
fruit_Apple food_Bread food_Cake
smoothie_S_Strawberry 1 0 1
smoothie_S_Watermelon 0 1 0
tea_T_green 1 0 0
tea_T_white 1 0 1
You can get rid of the prefixes with:
pd.get_dummies(drink, prefix='', prefix_sep=''). T # pd.get_dummies(food, prefix='', prefix_sep='')
Output:
Apple Bread Cake
S_Strawberry 1 0 1
S_Watermelon 0 1 0
T_green 1 0 0
T_white 1 0 1

How often is an item in a purchase?

I would like to calculate how often an item appears in a shopping cart.
I have a purchase recognizable by the buyerid. This buyerid can buy several items (also twice, triple,..., n-th times). Recognizable by itemid and description.
I would like to count the number of times an item ends up in a shopping cart. For example, out of 5 purchases, 3 people bought an apple, i.e. 0.6%. I would like to spend this on all products, how do I do that?
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
display(df.head(20))
My try:
# How many % of the articels are the same?
# this is wrong... :/
df_grouped = df.groupby('description').count()
display(df_grouped)
df_apple = df_grouped.iloc[0]
percentage = df_apple[0] / df.shape[0]
print(percentage)
[OUT] 0.45454545454545453
The mathematic formula
count of all buys (count_buy ) = 5
count how many an apple appears in the buy (count_apple) = 3
count_buy /count_apple = 3 / 5 = 0.6
What I would like to have (please note, I have not calculated the values, these are just dumy values)

Use GroupBy.size and divide by count of unique values of buyerid by Series.nunique:
print (df.groupby(['itemid','description']).size())
itemid description
0 Banana 1
1 Apple 3
Banana 2
Strawberry 1
2 Apple 2
4 Dog-Food 1
5 Beef 1
dtype: int64
purch = df['buyerid'].nunique()
df1 = df.groupby(['itemid','description']).size().div(purch).reset_index(name='percentage')
print (df1)
itemid description percentage
0 0 Banana 0.2
1 1 Apple 0.6
2 1 Banana 0.4
3 1 Strawberry 0.2
4 2 Apple 0.4
5 4 Dog-Food 0.2
6 5 Beef 0.2

I would group it and create a new column as follows:
df_grp = df.groupby('description')['buyerid'].sum().reset_index(name='total')
df_grp['percentage'] = (df_grp.total / df_grp.total.sum()) * 100
df_grp
Result:
description total percentage
0 Apple 11 39.285714
1 Banana 7 25.000000
2 Beef 4 14.285714
3 Dog-Food 4 14.285714
4 Strawberry 2 7.142857

As always, there are multiple ways to the gold, but i would go over pivoting as following:
Your input:
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
In a next step pivot the data with buyer_id as index and description as columns and replace NA with 0 as such
df2 = df.pivot_table(values='itemid', index='buyerid', columns='description', aggfunc='count')
df2 = df2.fillna(0)
resulting in
description Apple Banana Beef Dog-Food Strawberry
buyerid
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0
3 2.0 1.0 0.0 0.0 0.0
4 1.0 1.0 1.0 1.0 0.0
calling the mean on the table:
df_final = df2.mean()
results in
description
Apple 1.0
Banana 0.6
Beef 0.2
Dog-Food 0.2
Strawberry 0.2
dtype: float64

How to normalize columns with one-hot encoding efficiently in pandas dataframes?

A column of an example dataframe is shown:
Fruit FruitA FruitB
Apple Banana Mango
Banana Apple Apple
Mango Apple Banana
Banana Mango Banana
Mango Banana Apple
Apple Mango Mango
I want to introduce new columns in the dataframe Fruit-Apple, Fruit-Mango, Fruit-Banana with one-hot encoding in the rows they are respectively present. So, the desired output is:
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
Apple Banana Mango 1 1 1
Banana Apple Apple 1 1 0
Mango Apple Banana 1 1 1
Banana Mango Banana 0 1 1
Mango Banana Apple 1 1 1
Apple Mango Mango 1 0 1
My code to do this is:
for i in range(len(data)):
if (data['Fruits'][i] == 'Apple' or data['FruitsA'][i] == 'Apple' or data['FruitsB'][i] == 'Apple'):
data['Fruits-Apple'][i]=1
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Banana' or data['FruitsA'][i] == 'Banana' or data['FruitsB'][i] == 'Banana'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=1
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Mango' or data['FruitsA'][i] == 'Mango' or data['FruitsB'][i] == 'Mango'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=1
But I notice that the time taken for running this code increases dramatically if there are a lot of types of 'fruits'. In my actual data, there are only 1074 rows, and the column I'm trying to "normalize" with one-hot encoding has 18 different values. So, there are 18 if conditions inside the for loop, and the code hasn't finished running for 15 mins now. That's absurd (It would be great to know why its taking so long - in another column that contained only 6 different types of values, the code took much less time to execute, about 3 mins).
So, what's the best (vectorized) way to achieve this output?

Use join with get_dummies and add_prefix:
df = df.join(pd.get_dummies(df['Fruit']).add_prefix('Fruit-'))
print (df)
Fruit Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple 1 0 0
1 Banana 0 1 0
2 Mango 0 0 1
3 Banana 0 1 0
4 Mango 0 0 1
5 Apple 1 0 0
EDIT: If input are multiple columns use get_dummies with max by columns:
df = (df.join(pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix('Fruit-')))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
For better performance use MultiLabelBinarizer with DataFrame converted to lists:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.values.tolist()),
columns=mlb.classes_,
index=df.index).add_prefix('Fruit-'))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1

Similar pandas statement of SQL where clause

Two tables:
Price list table PRICE_LIST:
ITEM PRICE
MANGO 5
BANANA 2
APPLE 2.5
ORANGE 1.5
Records of sale REC_SALE (list of transactions)
ITEM SELLLING_PRICE
MANGO 4
MANGO 3
BANANA 2
BANANA 1
ORANGE 0.5
ORANGE 4
Selecting records from REC_SALE where Items were sold less than the PRICE listed in the PRICE_LIST table
SELECT A.*
FROM
(
select RS.ITEM,RS.SELLING_PRICE, PL.PRICE AS ACTUAL_PRICE
from REC_SALE RS,
PRICE_LIST PL
where RS.ITEM = PL.ITEM
) A
WHERE A.SELLING_PRICE < A.ACTUAL_PRICE ;
Result:
ITEM SELLING_PRICE PRICE
MANGO 4 5
MANGO 3 5
BANANA 1 2
ORANGE 0.5 1.5
I have these same two tables as dataframe in jupyter notebook
what would be a equivalent python statement of the SQL statement above using pandas?

merge with .loc
df1.merge(df2).loc[lambda x : x.PRICE>x.SELLLING_PRICE]
Out[198]:
ITEM PRICE SELLLING_PRICE
0 MANGO 5.0 4.0
1 MANGO 5.0 3.0
3 BANANA 2.0 1.0
4 ORANGE 1.5 0.5

Use merge with query:
df = pd.merge(df1, df2, on='ITEM').query('PRICE >SELLLING_PRICE')
print (df)
ITEM PRICE SELLLING_PRICE
0 MANGO 5.0 4.0
1 MANGO 5.0 3.0
3 BANANA 2.0 1.0
4 ORANGE 1.5 0.5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

"Correlation matrix" for strings. Similarity of nominal data - python

Here is my data frame. df store_1 store_2 store_3 store_4 0 banana banana plum banana 1 orange tangerine pear orange 2 apple pear melon apple 3 pear raspberry pineapple plum 4 plum tomato peach tomato I'm looking for the way to count number of co-occurrences in stores (to compare their similarity).

Related

Search N consecutive rows with same value in one dataframe

Co-occurence matrix from two data frames. Python

How often is an item in a purchase?

How to normalize columns with one-hot encoding efficiently in pandas dataframes?

Similar pandas statement of SQL where clause

Categories

Resources