Using python and pandas to create combinations instead of permutations - python

I have a dataset structurally similar to the one created below. Imagine each user brought a bag with the corresponding fruit. I want to count all pairwise combinations (not permutations) of fruit options, and use them to generate a probability that a user owns the bag after pulling two fruits out of it. There is an assumption that no user ever brings two of the same fruit.
import pandas as pd
df = pd.DataFrame({'user':['Matt', 'Matt', 'Matt', 'Matt', 'Tom', 'Tom', 'Tom', 'Tom', 'Nick', 'Nick', 'Nick', 'Nick', 'Nick'], 'fruit': ['Plum', 'Apple', 'Orange', 'Pear', 'Grape', 'Apple', 'Orange', 'Banana', 'Orange', 'Grape', 'Apple', 'Banana', 'Tomato']})[['user', 'fruit']]
print df
My thought was to merge the dataframe back onto itself on user, and generate counts based on unique pairs of fruit_x and fruit_y.
df_merged = df.merge(df, how='inner', on='user')
print df_merged
Unfortunately the merge yields two types of unwanted results. Instances where a fruit has been merged back onto itself are easy to fix.
df_fix1 = df_merged.query('fruit_x != fruit_y')
gb_pair_user = df_fix1.groupby(['user', 'fruit_x', 'fruit_y'])
df_fix1['pair_user_count'] = gb_pair_user['user'].transform('count')
gb_pair = df_fix1.groupby(['fruit_x', 'fruit_y'])
df_fix1['pair_count'] = gb_pair['user'].transform('count')
df_fix1['probability'] = df_fix1['pair_user_count'] / df_fix1['pair_count'] *1.0
print df_fix1[['fruit_x', 'fruit_y', 'probability', 'user']]
The second type is where I'm stuck. There is no meaningful difference between Apple+Orange and Orange+Apple, so I'd like to remove one of those rows. If there is a way to get proper combinations, I'd be very interested in that, otherwise if anyone can suggest a hack to eliminate the duplicated information that would be great too.

You can take the advantage of combinations from itertools to create unique pair of combination of fruits for each user.
from itertools import combinations
def func(group):
return pd.DataFrame(list(combinations(group.fruit, 2)), columns=['fruit_x', 'fruit_y'])
df.groupby('user').apply(func).reset_index(level=1, drop=True)
fruit_x fruit_y
user
Matt Plum Apple
Matt Plum Orange
Matt Plum Pear
Matt Apple Orange
Matt Apple Pear
Matt Orange Pear
Nick Orange Grape
Nick Orange Apple
Nick Orange Banana
Nick Orange Tomato
Nick Grape Apple
Nick Grape Banana
Nick Grape Tomato
Nick Apple Banana
Nick Apple Tomato
Nick Banana Tomato
Tom Grape Apple
Tom Grape Orange
Tom Grape Banana
Tom Apple Orange
Tom Apple Banana
Tom Orange Banana
You can then calculate the probability according to your program logic.

Related

How can I convert a dict of arrays into a 'flattened' dataframe?

Let's say I had a dictionary of arrays, eg:
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
I want to convert it to a pandas dataframe, with columns as "Flavour" and "Person". It should look like this:
Flavour
Person
vanilla
Josh
banana
Josh
chocolate
Greg
mint
Sarah
vanilla
Sarah
mango
Sarah
What's the most efficient way to do this?
Another solution, using .explode():
df = pd.DataFrame(
{
"Person": favourite_icecreams.keys(),
"Flavour": favourite_icecreams.values(),
}
).explode("Flavour")
print(df)
Prints:
Person Flavour
0 Josh vanilla
0 Josh banana
1 Greg chocolate
2 Sarah mint
2 Sarah vanilla
2 Sarah mango
You can use (generator) comprehension and then feed it to pd.DataFrame:
import pandas as pd
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
data = ((flavour, person)
for person, flavours in favourite_icecreams.items()
for flavour in flavours)
df = pd.DataFrame(data, columns=('Flavour', 'Person'))
print(df)
# Flavour Person
# 0 vanilla Josh
# 1 banana Josh
# 2 chocolate Greg
# 3 mint Sarah
# 4 vanilla Sarah
# 5 mango Sarah
You can do this purely in pandas like below using DataFrame.from_dict and df.stack:
In [453]: df = pd.DataFrame.from_dict(favourite_icecreams, orient='index').stack().reset_index().drop('level_1', 1)
In [455]: df.columns = ['Person', 'Flavour']
In [456]: df
Out[456]:
Person Flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango
One option is to extract person and flavour into separate lists, use numpy repeat on the person list, and finally create the DataFrame:
from itertools import chain
person, flavour = zip(*favourite_icecreams.items())
lengths = list(map(len, flavour))
person = np.array(person).repeat(lengths)
flavour = chain.from_iterable(flavour)
pd.DataFrame({'person':person, 'flavour':flavour})
person flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango

Use pandas python to filter and combine multiple cells into one cells Excel

I'm having a table with multiple columns and repeating data on all of the columns, except one (Address).
Last Name First Name Food Address
Brown James Apple 1
Brown Duke Apple 2
William Sam Apple 3
Miller Karen Apple 4
William Barry Orange 5
William Sam Orange 6
Brown James Orange 7
Miller Karen Banana 8
Brown Terry Banana 9
I want to merge all first names sharing the same last name and food into one entry, and keep the first address found when that condition is met.
The result will look like this:
Does anyone know any functions in (pandas) python that allow me to add multiple cells into one? Also, what would be the best approach to solve this?
Thanks!
This should do the trick. May be a faster way to put it all together, but in the end I pulled out rows with repeated First Names, transformed them, and put them back into the non-repeated dataframe.
I added another repeating row to be sure it worked with more than just two repeating names.
d = {'Last Name': ['Brown', 'Brown', 'Brown', 'William','Miller', 'William', 'William','Brown', 'Miller', 'Brown'],
'First Name':['Bill', 'James', 'Duke', 'Sam','Karen', 'Barry', 'Sam','James', 'Karen', 'Terry'],
'Food': ['Apple', 'Apple', 'Apple', 'Apple','Apple', 'Orange', 'Orange','Orange', 'Banana', 'Banana'],
'Address': [0, 1,2,3,4,5,6,7,8,9]}
df=pd.DataFrame(d)
grp_df = df.groupby(['Last Name', 'Food'])
df_nonrepeats = df[grp_df['First Name'].transform('count') == 1]
df_repeats = df[grp_df['First Name'].transform('count') > 1]
def concat_repeats(x):
dff = x.copy()
temp_list = ' '.join(dff['First Name'].tolist())
dff['First Name'] = temp_list
dff = dff.head(1)
return dff
grp_df = df_repeats.groupby(['Last Name', 'Food'])
df_concats = grp_df.apply(lambda x: concat_repeats(x))
df_final = pd.concat([df_nonrepeats, df_concats[['Last Name', 'First Name', 'Food', 'Address']]]).sort_values('Address').reset_index(drop=True)
print (df_final)

How to create an adjacency matrix by counting number of co-appearance in a dataframe?

I want to create a network in R. I have a dataframe looks like this. Say Alex has an apple and a banana, Brian has two apple and a Peach, and John has...
Alex Apple
Alex Banana
Alex Kiwi
Brian Apple
Brian Apple
Brian Peach
John Kiwi
John Peach
John Banana
Chris Melon
Chris Apple
...
I want use this dataframe to create a non-directed network that use fruit as nodes. If one person has both two different fruits, say John has a peach and kiwi, then there is a edge between node peach and kiwi. the weight of the edge is how many people has both these fruits(nodes).
I'm think about creating an adjacency matrix first, but don't know how to do it. If you have a better idea about creating a different network based on this dataframe, please give me a hint.
Since OP does not have a desired output, assuming that dupes are to be removed, here is an option using combn in data.table:
edges <- unique(DT)[, if (.N > 1L) transpose(combn(Fruit, 2L, simplify=FALSE)), Person][,
.N, .(V1, V2)]
library(igraph)
g <- graph_from_data_frame(edges)
set_edge_attr(g, "weight", value=edges$N)
plot(g)
#to check weights, use get.data.frame(g)
edges:
V1 V2 N
1: Apple Banana 1
2: Apple Kiwi 1
3: Banana Kiwi 1
4: Apple Peach 1
5: Kiwi Peach 1
6: Kiwi Banana 1
7: Peach Banana 1
8: Melon Apple 1
data:
library(data.table)
DT <- fread("Person Fruit
Alex Apple
Alex Banana
Alex Kiwi
Brian Apple
Brian Apple
Brian Peach
John Kiwi
John Peach
John Banana
Chris Melon
Chris Apple
Andrew Apple")

Assign specific nominal values randomly to rows using pandas

I want to assign some selected nominal values randomly to rows. For example:
I have three nominal values ["apple", "orange", "banana"].
Before assign these values randomly to rows:
**Name Fruit**
Jack
Julie
Juana
Jenny
Christina
Dickens
Robert
Cersei
After assign these values randomly to rows:
**Name Fruit**
Jack Apple
Julie Orange
Juana Apple
Jenny Banana
Christina Orange
Dickens Orange
Robert Apple
Cersei Banana
How can I do this using pandas dataframe?
You can use pd.np.random.choice with your values:
vals = ["apple", "orange", "banana"]
df['Fruit'] = pd.np.random.choice(vals, len(df))
>>> df
Name Fruit
0 Jack apple
1 Julie orange
2 Juana apple
3 Jenny orange
4 Christina apple
5 Dickens banana
6 Robert orange
7 Cersei orange
You can create a DataFrame in pandas and then assign random choices using numpy
ex2 = pd.DataFrame({'Name':['Jack','Julie','Juana','Jenny','Christina','Dickens','Robert','Cersei']})
ex2['Fruits'] = np.random.choice(['Apple','Orange','Banana'],ex2.shape[0])

python pandas non-unique dict keys

I have an Excel file with data like this
Fruits Description
oranges This is an orange
apples This is an apple
oranges This is also oranges
plum this is a plum
plum this is also a plum
grape I can make some wine
grape make it red
I'm turning this into a dictionary using the below code
import pandas as pd
import xlrd
file = 'example.xlsx'
x1 = pd.ExcelFile(file)
print(x1.sheet_names)
df1 = x1.parse('Sheet1')
#print(df1)
print(df1.set_index('Fruits').T.to_dict('list'))
When i execute the above i get the error
UserWarning: DataFrame columns are not unique, some columns will be omitted.
I want to have a dictionary that looks like the below
{'oranges': ['this is an orange', 'this is also oranges'], 'apples':['this is an apple'],
'plum'['This is a plum', 'this is also a plum'], 'grape'['i can make some wine', 'make it red']}
How about this?
df.groupby(['Fruits'])['Description'].apply(list).to_dict()
{'apples': ['This is an apple'],
'grape': ['make it red', 'I can make some wine'],
'oranges': ['This is an orange', 'This is also oranges'],
'plum': ['this is a plum', 'this is also a plum']}

Categories