How can I convert a dict of arrays into a 'flattened' dataframe? - python

Let's say I had a dictionary of arrays, eg:
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
I want to convert it to a pandas dataframe, with columns as "Flavour" and "Person". It should look like this:
Flavour
Person
vanilla
Josh
banana
Josh
chocolate
Greg
mint
Sarah
vanilla
Sarah
mango
Sarah
What's the most efficient way to do this?

Another solution, using .explode():
df = pd.DataFrame(
{
"Person": favourite_icecreams.keys(),
"Flavour": favourite_icecreams.values(),
}
).explode("Flavour")
print(df)
Prints:
Person Flavour
0 Josh vanilla
0 Josh banana
1 Greg chocolate
2 Sarah mint
2 Sarah vanilla
2 Sarah mango

You can use (generator) comprehension and then feed it to pd.DataFrame:
import pandas as pd
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
data = ((flavour, person)
for person, flavours in favourite_icecreams.items()
for flavour in flavours)
df = pd.DataFrame(data, columns=('Flavour', 'Person'))
print(df)
# Flavour Person
# 0 vanilla Josh
# 1 banana Josh
# 2 chocolate Greg
# 3 mint Sarah
# 4 vanilla Sarah
# 5 mango Sarah

You can do this purely in pandas like below using DataFrame.from_dict and df.stack:
In [453]: df = pd.DataFrame.from_dict(favourite_icecreams, orient='index').stack().reset_index().drop('level_1', 1)
In [455]: df.columns = ['Person', 'Flavour']
In [456]: df
Out[456]:
Person Flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango

One option is to extract person and flavour into separate lists, use numpy repeat on the person list, and finally create the DataFrame:
from itertools import chain
person, flavour = zip(*favourite_icecreams.items())
lengths = list(map(len, flavour))
person = np.array(person).repeat(lengths)
flavour = chain.from_iterable(flavour)
pd.DataFrame({'person':person, 'flavour':flavour})
person flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango

Related

How do I fill a string column using a set in Pandas dataframe?

I have a huge dataset with two specific columns for Sales Person and Manager. I want to make a new column which assigns sales person name on different basis.
So lets say that Under Manager John, I have 4 executives - A, B, C, D
I want to replace the existing sales person under John with the executives A, B, C and D in a sequence.
Here is what I want to do -
Input-
ID
SalesPerson
Sales Manager
AM12
Oliver
Bren
AM21
Athreyu
John
AM31
Margarita
Fer
AM41
Jenny
Fer
AM66
Omar
John
AM81
Michael
Nati
AM77
Orlan
John
AM87
Erika
Nateran
AM27
Jesus
John
AM69
Randy
John
Output -
ID
SalesPerson
Sales Manager
SalesPerson_new
AM12
Oliver
Bren
oliver
AM21
Athreyu
John
A
AM31
Margarita
Fer
Margarita
AM41
Jenny
Fer
Jenny
AM66
Omar
John
B
AM81
Michael
Nati
Michael
AM77
Orlan
John
C
AM87
Erika
Nateran
Nateran
AM27
Jesus
John
D
AM69
Randy
John
A
We can do this with cumcount and .map
first we need to build up a dictionary that repeats ABCD in multiple of fours.
i.e {0 : 'A', 1 : 'B', 2 : 'C', 3 : 'D', 4 : 'A'}
we can do this with a helper function and some handy modules from the itertools library.
from itertools import cycle, zip_longest, islice
from string import ascii_uppercase
import pandas as pd
import numpy as np
def repeatlist(it, count):
return islice(cycle(it), count)
mapper = dict(zip_longest(range(50), repeatlist(ascii_uppercase[:4],50)))
df['SalesPersonNew'] = np.where(
df['Sales Manager'].eq('John'),
df.groupby('Sales Manager')['SalesPerson'].cumcount().map(mapper),
df['SalesPerson'])
print(df)
ID SalesPerson Sales Manager SalesPersonNew
0 AM12 Oliver Bren Oliver
1 AM21 Athreyu John A
2 AM31 Margarita Fer Margarita
3 AM41 Jenny Fer Jenny
4 AM66 Omar John B
5 AM81 Michael Nati Michael
6 AM77 Orlan John C
7 AM87 Erika Nateran Erika
8 AM27 Jesus John D
9 AM69 Randy John A
Let's say that your dataframe is the variable df.
First you need to create the new column on your dataframe, which you can initiate with the values already present in the SalesPerson column.
df["SalesPerson_new"] = df["SalesPerson"]
Then you can make a view of your dataframe to select the rows where the value of Sales Manager is John, and use that to update the SalesPerson_new column.
number_of_rows = len(df.loc[df["Sales Manager"] == "John", :])
df.loc[df["Sales Manager"] == "John", :] = ["A", "B", "C", "D"][:number_of_rows]
It is important to note that this will work only if the list ["A", "B", "C", "D"] has a length equal or larger than the number of rows in the filtered_df

How to create an adjacency matrix by counting number of co-appearance in a dataframe?

I want to create a network in R. I have a dataframe looks like this. Say Alex has an apple and a banana, Brian has two apple and a Peach, and John has...
Alex Apple
Alex Banana
Alex Kiwi
Brian Apple
Brian Apple
Brian Peach
John Kiwi
John Peach
John Banana
Chris Melon
Chris Apple
...
I want use this dataframe to create a non-directed network that use fruit as nodes. If one person has both two different fruits, say John has a peach and kiwi, then there is a edge between node peach and kiwi. the weight of the edge is how many people has both these fruits(nodes).
I'm think about creating an adjacency matrix first, but don't know how to do it. If you have a better idea about creating a different network based on this dataframe, please give me a hint.
Since OP does not have a desired output, assuming that dupes are to be removed, here is an option using combn in data.table:
edges <- unique(DT)[, if (.N > 1L) transpose(combn(Fruit, 2L, simplify=FALSE)), Person][,
.N, .(V1, V2)]
library(igraph)
g <- graph_from_data_frame(edges)
set_edge_attr(g, "weight", value=edges$N)
plot(g)
#to check weights, use get.data.frame(g)
edges:
V1 V2 N
1: Apple Banana 1
2: Apple Kiwi 1
3: Banana Kiwi 1
4: Apple Peach 1
5: Kiwi Peach 1
6: Kiwi Banana 1
7: Peach Banana 1
8: Melon Apple 1
data:
library(data.table)
DT <- fread("Person Fruit
Alex Apple
Alex Banana
Alex Kiwi
Brian Apple
Brian Apple
Brian Peach
John Kiwi
John Peach
John Banana
Chris Melon
Chris Apple
Andrew Apple")

Pandas Get List of Unique Values in Column A for each Unique Value in Column B

I'm finding this problem easy to write out, but difficult to apply with my Pandas Dataframe.
When searching for anything 'unique values' and 'list' I only get answers for getting the unique values in a list.
There is a brute force solution with a double for loop, but there must be a faster Pandas solution than n^2.
I have a DataFrame with two columns: Name and Likes Food.
As output, I want a list of unique Likes Food values for each unique Name.
Example Dataframe df
Index Name Likes Food
0 Tim Pizza
1 Marie Pizza
2 Tim Pasta
3 Tim Pizza
4 John Pizza
5 Amy Pizza
6 Amy Sweet Potatoes
7 Marie Sushi
8 Tim Sushi
I know how to aggregate and groupby the unique count of Likes Food:
df.groupby( by='Name', as_index=False ).agg( {'Likes Food': pandas.Series.nunique} )
df.sort_values(by='Likes Food', ascending=False)
df.reset_index( drop=True )
>>>
Index Name Likes Food
0 Tim 3
1 Marie 2
2 Amy 2
3 John 1
But given that, what ARE the foods for each Name in that DataFrame? For readability, expressed as a list makes good sense. List sorting doesn't matter (and is easy to fix probably).
Example Output
<code here>
>>>
Index Name Likes Food Food List
0 Tim 3 [Pizza, Pasta, Sushi]
1 Marie 2 [Pizza, Sushi]
2 Amy 2 [Pizza, Sweet Potatoes]
3 John 1 [Pizza]
To obtain the output without the counts, just try unique
df.groupby("Name")["Likes"].unique()
Name
Amy [Pizza, Sweet Potatoes]
John [Pizza]
Marie [Pizza, Sushi]
Tim [Pizza, Pasta, Sushi]
Name: Likes, dtype: object
additionally, you can used named aggregation
df.groupby("Name").agg(**{"Likes Food": pd.NamedAgg(column='Likes', aggfunc="size"),
"Food List": pd.NamedAgg(column='Likes', aggfunc="nunique")}).reset_index()
Name Likes Food Food List
0 Amy 2 [Pizza, Sweet Potatoes]
1 John 1 [Pizza]
2 Marie 2 [Pizza, Sushi]
3 Tim 3 [Pizza, Pasta, Sushi]
To get both columns, also sorted, try this:
df = df.groupby("Name")["Likes_Food"].aggregate({'counts': 'nunique',
'food_list': 'unique'}).reset_index().sort_values(by='counts', ascending=False)
df
Name counts food_list
3 Tim 3 [Pizza, Pasta, Sushi]
0 Amy 2 [Pizza, SweetPotatoes]
2 Marie 2 [Pizza, Sushi]
1 John 1 [Pizza]

Assign specific nominal values randomly to rows using pandas

I want to assign some selected nominal values randomly to rows. For example:
I have three nominal values ["apple", "orange", "banana"].
Before assign these values randomly to rows:
**Name Fruit**
Jack
Julie
Juana
Jenny
Christina
Dickens
Robert
Cersei
After assign these values randomly to rows:
**Name Fruit**
Jack Apple
Julie Orange
Juana Apple
Jenny Banana
Christina Orange
Dickens Orange
Robert Apple
Cersei Banana
How can I do this using pandas dataframe?
You can use pd.np.random.choice with your values:
vals = ["apple", "orange", "banana"]
df['Fruit'] = pd.np.random.choice(vals, len(df))
>>> df
Name Fruit
0 Jack apple
1 Julie orange
2 Juana apple
3 Jenny orange
4 Christina apple
5 Dickens banana
6 Robert orange
7 Cersei orange
You can create a DataFrame in pandas and then assign random choices using numpy
ex2 = pd.DataFrame({'Name':['Jack','Julie','Juana','Jenny','Christina','Dickens','Robert','Cersei']})
ex2['Fruits'] = np.random.choice(['Apple','Orange','Banana'],ex2.shape[0])

Using python and pandas to create combinations instead of permutations

I have a dataset structurally similar to the one created below. Imagine each user brought a bag with the corresponding fruit. I want to count all pairwise combinations (not permutations) of fruit options, and use them to generate a probability that a user owns the bag after pulling two fruits out of it. There is an assumption that no user ever brings two of the same fruit.
import pandas as pd
df = pd.DataFrame({'user':['Matt', 'Matt', 'Matt', 'Matt', 'Tom', 'Tom', 'Tom', 'Tom', 'Nick', 'Nick', 'Nick', 'Nick', 'Nick'], 'fruit': ['Plum', 'Apple', 'Orange', 'Pear', 'Grape', 'Apple', 'Orange', 'Banana', 'Orange', 'Grape', 'Apple', 'Banana', 'Tomato']})[['user', 'fruit']]
print df
My thought was to merge the dataframe back onto itself on user, and generate counts based on unique pairs of fruit_x and fruit_y.
df_merged = df.merge(df, how='inner', on='user')
print df_merged
Unfortunately the merge yields two types of unwanted results. Instances where a fruit has been merged back onto itself are easy to fix.
df_fix1 = df_merged.query('fruit_x != fruit_y')
gb_pair_user = df_fix1.groupby(['user', 'fruit_x', 'fruit_y'])
df_fix1['pair_user_count'] = gb_pair_user['user'].transform('count')
gb_pair = df_fix1.groupby(['fruit_x', 'fruit_y'])
df_fix1['pair_count'] = gb_pair['user'].transform('count')
df_fix1['probability'] = df_fix1['pair_user_count'] / df_fix1['pair_count'] *1.0
print df_fix1[['fruit_x', 'fruit_y', 'probability', 'user']]
The second type is where I'm stuck. There is no meaningful difference between Apple+Orange and Orange+Apple, so I'd like to remove one of those rows. If there is a way to get proper combinations, I'd be very interested in that, otherwise if anyone can suggest a hack to eliminate the duplicated information that would be great too.
You can take the advantage of combinations from itertools to create unique pair of combination of fruits for each user.
from itertools import combinations
def func(group):
return pd.DataFrame(list(combinations(group.fruit, 2)), columns=['fruit_x', 'fruit_y'])
df.groupby('user').apply(func).reset_index(level=1, drop=True)
fruit_x fruit_y
user
Matt Plum Apple
Matt Plum Orange
Matt Plum Pear
Matt Apple Orange
Matt Apple Pear
Matt Orange Pear
Nick Orange Grape
Nick Orange Apple
Nick Orange Banana
Nick Orange Tomato
Nick Grape Apple
Nick Grape Banana
Nick Grape Tomato
Nick Apple Banana
Nick Apple Tomato
Nick Banana Tomato
Tom Grape Apple
Tom Grape Orange
Tom Grape Banana
Tom Apple Orange
Tom Apple Banana
Tom Orange Banana
You can then calculate the probability according to your program logic.

Categories