Co-occurence matrix from two data frames. Python

Co-occurence matrix from two data frames. Python - python

I have two data frames, Food and Drink.
food = {'fruit':['Apple', np.nan, 'Apple'],
'food':['Cake', 'Bread', np.nan]}
# Create DataFrame
food = pd.DataFrame(food)
fruit food
0 Apple Cake
1 NaN Bread
2 Apple NaN
drink = {'smoothie':['S_Strawberry', 'S_Watermelon', np.nan],
'tea':['T_white', np.nan, 'T_green']}
# Create DataFrame
drink = pd.DataFrame(drink)
smoothie tea
0 S_Strawberry T_white
1 S_Watermelon NaN
2 NaN T_green
The rows represent specific customers.
I would like to make a co-occurrence matrix of food and drinks.
expected outcome: (columns and ids do not have to be in this order)
Apple Bread Cake
S_Strawberry 1.0 NaN 1.0
S_Watermelon NaN 1.0 NaN
T_white 1.0 NaN 1.0
T_green 1.0 NaN NaN
so far I can make a co-occurrence matrix for each of the df but I don't know how I would bind the two data frames.
thank you.

I think you want pd.get_dummies and matrix multiplication:
pd.get_dummies(drink). T # pd.get_dummies(food)
Output:
fruit_Apple food_Bread food_Cake
smoothie_S_Strawberry 1 0 1
smoothie_S_Watermelon 0 1 0
tea_T_green 1 0 0
tea_T_white 1 0 1
You can get rid of the prefixes with:
pd.get_dummies(drink, prefix='', prefix_sep=''). T # pd.get_dummies(food, prefix='', prefix_sep='')
Output:
Apple Bread Cake
S_Strawberry 1 0 1
S_Watermelon 0 1 0
T_green 1 0 0
T_white 1 0 1

Related

Pandas merging/joining tables with multiple key columns and duplicating rows where necessary

I have several tables that contain lab results, with a 'master' table of sample data with things like a description. The results tables are also broken down by specimen (sub-samples). They contain multiple results columns - I'm just showing one here. I want to combine all the results tables into one dataframe, like this:
Table 1:
Location Sample Description
1 A Yellow
1 B Red
2 A Blue
2 B Violet
Table 2
Location Sample Specimen Result1
1 A X 5
1 A Y 6
1 B X 10
2 A X 1
Table 3
Location Sample Specimen Result2
1 A X "Heavy"
1 A Q "Soft"
1 B K "Grey"
2 B Z "Bananas"
Desired Output:
Location Sample Description Specimen Result1 Result2
1 A Yellow X 5 "Heavy"
1 A Yellow Y 6 nan
1 A Yellow Q nan "Soft"
1 B Red X 10 nan
1 B Red K nan "Grey"
2 A Blue X 1 nan
2 B Violet Z nan "Bananas"
I currently have a solution for this using iterrows() and df.append(), but these are both slow operations and when there are thousands of results it takes too long. Is there better way? I have tried using join() and merge() but I can't seem to get the result I want.
Quick code to reproduce my dataframes:
dict1 = {'Location': [1,1,2,2], 'Sample': ['A','B','A','B'], 'Description': ['Yellow','Red','Blue','Violet']}
dict2 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','A'], 'Specimen': ['x', 'y','x', 'x'], 'Result1': [5,6,10,1]}
dict3 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','B'], 'Specimen': ['x', 'q','k', 'z'], 'Result2': ["Heavy","Soft","Grey","Bananas"]}
df1 = pd.DataFrame.from_dict(dict1)
df2 = pd.DataFrame.from_dict(dict2)
df3 = pd.DataFrame.from_dict(dict3)

First idea is join df2, df3 together by concat and for unique 'Location','Sample','Specimen' rows are rows aggregated by sum, last merge to df1:
df23 = (pd.concat([df2, df3])
.groupby(['Location','Sample','Specimen'], as_index=False, sort=False)
.sum(min_count=1))
df = df1.merge(df23, on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow x 5.0 4.0
1 1 A Yellow y 6.0 NaN
2 1 A Yellow q NaN 6.0
3 1 B Red x 10.0 NaN
4 1 B Red k NaN 8.0
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
Or if all rows in df2,df3 per columns ['Location','Sample','Specimen'] are unique, solution is simplier:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN 6.0
1 1 A Yellow x 5.0 4.0
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN 8.0
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
EDIT: With new data second solution working well:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN Soft
1 1 A Yellow x 5.0 Heavy
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN Grey
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN Bananas

How to expand a df by different dict as columns?

I have a df with different dicts as entries in a column, in my case column "information". I would like to expand the df by all possible dict.keys(), something like that:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5]),
'name': pd.Series(['banana',
'apple',
'orange',
'strawberry' ,
'toast']),
'information': pd.Series([{'shape':'curve','color':'yellow'},
{'color':'red'},
{'shape':'round'},
{'amount':500},
np.nan]),
'cost': pd.Series([1,2,2,10,4])})
id name information cost
0 1 banana {'shape': 'curve', 'color': 'yellow'} 1
1 2 apple {'color': 'red'} 2
2 3 orange {'shape': 'round'} 2
3 4 strawberry {'amount': 500} 10
4 5 toast NaN 4
Should look like this:
id name shape color amount cost
0 1 banana curve yellow NaN 1
1 2 apple NaN red NaN 2
2 3 orange round NaN NaN 2
3 4 strawberry NaN NaN 500.0 10
4 5 toast NaN NaN NaN 4

Another approach would be using pandas.DataFrame.from_records:
import pandas as pd
new = pd.DataFrame.from_records(df.pop('information').apply(lambda x: {} if pd.isna(x) else x))
new = pd.concat([df, new], 1)
print(new)
Output:
cost id name amount color shape
0 1 1 banana NaN yellow curve
1 2 2 apple NaN red NaN
2 2 3 orange NaN NaN round
3 10 4 strawberry 500.0 NaN NaN
4 4 5 toast NaN NaN NaN

You can use:
d = {k: {} if v != v else v for k, v in df.pop('information').items()}
df1 = pd.DataFrame.from_dict(d, orient='index')
df = pd.concat([df, df1], axis=1)
print(df)
id name cost shape color amount
0 1 banana 1 curve yellow NaN
1 2 apple 2 NaN red NaN
2 3 orange 2 round NaN NaN
3 4 strawberry 10 NaN NaN 500.0
4 5 toast 4 NaN NaN NaN

"Correlation matrix" for strings. Similarity of nominal data

Here is my data frame.
df
store_1 store_2 store_3 store_4
0 banana banana plum banana
1 orange tangerine pear orange
2 apple pear melon apple
3 pear raspberry pineapple plum
4 plum tomato peach tomato
I'm looking for the way to count number of co-occurrences in stores (to compare their similarity).

You can try something like this
import itertools as it
corr = lambda a,b: len(set(a).intersection(set(b)))/len(a)
c = [corr(*x) for x in it.combinations_with_replacement(df.T.values.tolist(),2)]
j = 0
x = []
for i in range(4, 0, -1): # replace 4 with df.shape[-1]
x.append([np.nan]*(4-i) + c[j:j+i])
j+= i
pd.DataFrame(x, columns=df.columns, index=df.columns)
Which yields
store_1 store_2 store_3 store_4
store_1 1.0 0.4 0.4 0.8
store_2 NaN 1.0 0.2 0.4
store_3 NaN NaN 1.0 0.2
store_4 NaN NaN NaN 1.0

If you wish to estimate the similarity of the stores with regards to their products, then you could use:
One hot encoding
Then each stores can be described by a vector with length of n = number of all products among all stores such as:
banana
orange
apple
pear
plum
tangerin
raspberry
tomato
melon
.
.
.
Store_1 then is described as 1 1 1 1 1 0 0 0 0 0 ...
Store_2 1 0 0 1 0 1 1 1 0 ...
This leaves you with a numerical vector, where you can compute dissimilarity measure such as Euclidean Distance.

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?

I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

How to build "many-hot" in Python/Pandas?

I need to combine three columns of categorical data into a single set of binary category-named columns. This is similar to a "one-hot" but the source rows have up to three categories instead of just one. Also, note that there are 100+ categories and I will not know them beforehand.
id, fruit1, fruit2, fruit3
1, apple, orange,
2, orange, ,
3, banana, apple,
should generate...
id, apple, banana, orange
1, 1, 0, 1
2, 0, 0, 1
3, 1, 1, 0

You could use pd.melt to combine all the fruit columns into one column, and the use pd.crosstab to create a frequency table:
import numpy as np
import pandas as pd
df = pd.read_csv('data')
df = df.replace(r' ', np.nan)
# id fruit1 fruit2 fruit3
# 0 1 apple orange NaN
# 1 2 orange NaN NaN
# 2 3 banana apple NaN
melted = pd.melt(df, id_vars=['id'])
result = pd.crosstab(melted['id'], melted['value'])
print(result)
yields
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0
Explanation: The melted DataFrame looks like this:
In [148]: melted = pd.melt(df, id_vars=['id']); melted
Out[149]:
id variable value
0 1 fruit1 apple
1 2 fruit1 orange
2 3 fruit1 banana
3 1 fruit2 orange
4 2 fruit2 NaN
5 3 fruit2 apple
6 1 fruit3 NaN
7 2 fruit3 NaN
8 3 fruit3 NaN
We can ignore the variable column; it is the id and value which is important.
pd.crosstab can be used to create a frequency table with melted['id'] values in the index and melted['value'] values as the columns:
In [150]: pd.crosstab(melted['id'], melted['value'])
Out[150]:
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0

You can apply value counts to each row:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'fruit1': ['Apple', 'Banana', np.nan],
'fruit2': ['Banana', np.nan, 'Apple'],
'fruit3': ['Grape', np.nan, np.nan],
})
df = df.apply(lambda row: row.value_counts(), axis=1).fillna(0).applymap(int)
Before:
fruit1 fruit2 fruit3
0 Apple Banana Grape
1 Banana NaN NaN
2 NaN Apple NaN
After:
Apple Banana Grape
0 1 1 1
1 0 1 0
2 1 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Co-occurence matrix from two data frames. Python - python

Related

Pandas merging/joining tables with multiple key columns and duplicating rows where necessary

How to expand a df by different dict as columns?

"Correlation matrix" for strings. Similarity of nominal data

update pandas groupby group with column value

How to build "many-hot" in Python/Pandas?

Categories

Resources