I want to assign some selected nominal values randomly to rows. For example:
I have three nominal values ["apple", "orange", "banana"].
Before assign these values randomly to rows:
**Name Fruit**
Jack
Julie
Juana
Jenny
Christina
Dickens
Robert
Cersei
After assign these values randomly to rows:
**Name Fruit**
Jack Apple
Julie Orange
Juana Apple
Jenny Banana
Christina Orange
Dickens Orange
Robert Apple
Cersei Banana
How can I do this using pandas dataframe?
You can use pd.np.random.choice with your values:
vals = ["apple", "orange", "banana"]
df['Fruit'] = pd.np.random.choice(vals, len(df))
>>> df
Name Fruit
0 Jack apple
1 Julie orange
2 Juana apple
3 Jenny orange
4 Christina apple
5 Dickens banana
6 Robert orange
7 Cersei orange
You can create a DataFrame in pandas and then assign random choices using numpy
ex2 = pd.DataFrame({'Name':['Jack','Julie','Juana','Jenny','Christina','Dickens','Robert','Cersei']})
ex2['Fruits'] = np.random.choice(['Apple','Orange','Banana'],ex2.shape[0])
Related
I have a dataframe structured like this:
User
Food 1
Food 2
Food 3
Food 4
Steph
Onions
Tomatoes
Cabbages
Potatoes
Tom
Potatoes
Tomatoes
Potatoes
Potatoes
Fred
Carrots
Cabbages
Eggplant
Phil
Onions
Eggplant
Eggplant
I want to use the distinct values from across the food columns as categories. I then want to create a Seaborn plot so the % of each category for each column is plotted as a 100% horizontal stacked bar.
My attempt to do this:
data = {
'User' : ['Steph', 'Tom', 'Fred', 'Phil'],
'Food 1' : ["Onions", "Potatoes", "Carrots", "Onions"],
'Food 2' : ['Tomatoes', 'Tomatoes', 'Cabbages', 'Eggplant'],
'Food 3' : ["Cabbages", "Potatoes", "", "Eggplant"],
'Food 4' : ['Potatoes', 'Potatoes', 'Eggplant', ''],
}
df = pd.DataFrame(data)
x_ax = ["Onions", "Potatoes", "Carrots", "Onions", "", 'Eggplant', "Cabbages"]
df.plot(kind="barh", x=x_ax, y=["Food 1", "Food 2", "Food 3", "Food 4"], stacked=True, ax=axes[1])
plt.show()
Replace '' with np.nan because empty stings will be counted as values.
Use pandas.DataFrame.melt to convert the dataframe to a long form.
Use pandas.crosstab with the normalize parameter to calculate the percent for each 'Food'.
Plot the dataframe with pandas.DataFrame.plot and kind='barh'.
Putting the food names on the x-axis is not the correct way to create a 100% stacked bar plot. One axis must be numeric. The bars will be colored by food type.
Annotate the bars based on this answer.
Move the legend outside the plot based on this answer.
seaborn is a high-level API for matplotlib, and pandas uses matplotlib as the default backend, and it's easier to produce a stacked bar plot with pandas.
seaborn doesn't support stacked barplots, unless histplot is used in a hacked way, as shown in this answer, and would require an extra step of melting percent.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1
Assignment expressions (:=) require python >= 3.8. Otherwise, use [f'{v.get_width():.2f}%' if v.get_width() > 0 else '' for v in c ].
import pandas as pd
import numpy as np
# using the dataframe in the OP
# 1.
df = df.replace('', np.nan)
# 2.
dfm = df.melt(id_vars='User', var_name='Food', value_name='Type')
# 3.
percent = pd.crosstab(dfm.Food, dfm.Type, normalize='index').mul(100).round(2)
# 4.
ax = percent.plot(kind='barh', stacked=True, figsize=(8, 6))
# 5.
for c in ax.containers:
# customize the label to account for cases when there might not be a bar section
labels = [f'{w:.2f}%' if (w := v.get_width()) > 0 else '' for v in c ]
# set the bar label
ax.bar_label(c, labels=labels, label_type='center')
# 6.
ax.legend(bbox_to_anchor=(1, 1.02), loc='upper left')
DataFrame Views
dfm
User Food Type
0 Steph Food 1 Onions
1 Tom Food 1 Potatoes
2 Fred Food 1 Carrots
3 Phil Food 1 Onions
4 Steph Food 2 Tomatoes
5 Tom Food 2 Tomatoes
6 Fred Food 2 Cabbages
7 Phil Food 2 Eggplant
8 Steph Food 3 Cabbages
9 Tom Food 3 Potatoes
10 Fred Food 3 NaN
11 Phil Food 3 Eggplant
12 Steph Food 4 Potatoes
13 Tom Food 4 Potatoes
14 Fred Food 4 Eggplant
15 Phil Food 4 NaN
ct
Type Cabbages Carrots Eggplant Onions Potatoes Tomatoes
Food
Food 1 0 1 0 2 1 0
Food 2 1 0 1 0 0 2
Food 3 1 0 1 0 1 0
Food 4 0 0 1 0 2 0
total
Food
Food 1 4
Food 2 4
Food 3 3
Food 4 3
dtype: int64
percent
Type Cabbages Carrots Eggplant Onions Potatoes Tomatoes
Food
Food 1 0.00 25.0 0.00 50.0 25.00 0.0
Food 2 25.00 0.0 25.00 0.0 0.00 50.0
Food 3 33.33 0.0 33.33 0.0 33.33 0.0
Food 4 0.00 0.0 33.33 0.0 66.67 0.0
Let's say I had a dictionary of arrays, eg:
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
I want to convert it to a pandas dataframe, with columns as "Flavour" and "Person". It should look like this:
Flavour
Person
vanilla
Josh
banana
Josh
chocolate
Greg
mint
Sarah
vanilla
Sarah
mango
Sarah
What's the most efficient way to do this?
Another solution, using .explode():
df = pd.DataFrame(
{
"Person": favourite_icecreams.keys(),
"Flavour": favourite_icecreams.values(),
}
).explode("Flavour")
print(df)
Prints:
Person Flavour
0 Josh vanilla
0 Josh banana
1 Greg chocolate
2 Sarah mint
2 Sarah vanilla
2 Sarah mango
You can use (generator) comprehension and then feed it to pd.DataFrame:
import pandas as pd
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
data = ((flavour, person)
for person, flavours in favourite_icecreams.items()
for flavour in flavours)
df = pd.DataFrame(data, columns=('Flavour', 'Person'))
print(df)
# Flavour Person
# 0 vanilla Josh
# 1 banana Josh
# 2 chocolate Greg
# 3 mint Sarah
# 4 vanilla Sarah
# 5 mango Sarah
You can do this purely in pandas like below using DataFrame.from_dict and df.stack:
In [453]: df = pd.DataFrame.from_dict(favourite_icecreams, orient='index').stack().reset_index().drop('level_1', 1)
In [455]: df.columns = ['Person', 'Flavour']
In [456]: df
Out[456]:
Person Flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango
One option is to extract person and flavour into separate lists, use numpy repeat on the person list, and finally create the DataFrame:
from itertools import chain
person, flavour = zip(*favourite_icecreams.items())
lengths = list(map(len, flavour))
person = np.array(person).repeat(lengths)
flavour = chain.from_iterable(flavour)
pd.DataFrame({'person':person, 'flavour':flavour})
person flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango
I want to create a network in R. I have a dataframe looks like this. Say Alex has an apple and a banana, Brian has two apple and a Peach, and John has...
Alex Apple
Alex Banana
Alex Kiwi
Brian Apple
Brian Apple
Brian Peach
John Kiwi
John Peach
John Banana
Chris Melon
Chris Apple
...
I want use this dataframe to create a non-directed network that use fruit as nodes. If one person has both two different fruits, say John has a peach and kiwi, then there is a edge between node peach and kiwi. the weight of the edge is how many people has both these fruits(nodes).
I'm think about creating an adjacency matrix first, but don't know how to do it. If you have a better idea about creating a different network based on this dataframe, please give me a hint.
Since OP does not have a desired output, assuming that dupes are to be removed, here is an option using combn in data.table:
edges <- unique(DT)[, if (.N > 1L) transpose(combn(Fruit, 2L, simplify=FALSE)), Person][,
.N, .(V1, V2)]
library(igraph)
g <- graph_from_data_frame(edges)
set_edge_attr(g, "weight", value=edges$N)
plot(g)
#to check weights, use get.data.frame(g)
edges:
V1 V2 N
1: Apple Banana 1
2: Apple Kiwi 1
3: Banana Kiwi 1
4: Apple Peach 1
5: Kiwi Peach 1
6: Kiwi Banana 1
7: Peach Banana 1
8: Melon Apple 1
data:
library(data.table)
DT <- fread("Person Fruit
Alex Apple
Alex Banana
Alex Kiwi
Brian Apple
Brian Apple
Brian Peach
John Kiwi
John Peach
John Banana
Chris Melon
Chris Apple
Andrew Apple")
I am trying to have a change a column if some strings are in the other column in the same row. I am new to Pandas.
I need to change the price of some oranges to 200 but not the price of 'Red Orange'. I cannot change the name of the "fruits". It is a much longer string and I just made it shorter for convenience here.
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 16
Red Orange from Costa 15
Pink Orange from Brazil 19
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
so that the final result would be
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 200
Red Orange from Costa 15
Pink Orange from Brazil 200
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
I tried
df.loc[df['fruits'].str.lower().str.contains('orange'), 'price'] = 200
But this produces total of 4 items to change its price instead of only 2 items.
I have used for loop once and that changed the entire column to change its price.
You can use regex:
import re
df.loc[df['fruits'].str.lower().str.contains(r'(?<!red) orange', regex = True), 'price'] = 200
(?<!red) is a negative look behind. So if behind orange is red it wont match it. It also ensure its the second word with the mandatory space before the word orange, so you wont have to worry about it been the color describing something.
df.loc[((df['fruits'].str.contains('orange')) & (~df['fruits'].str.contains('Red'))),'price'] = 200
We check for oranges and ~ to confirm red is not present in the string. If both conditions are true, price change to 200
I have a dataset structurally similar to the one created below. Imagine each user brought a bag with the corresponding fruit. I want to count all pairwise combinations (not permutations) of fruit options, and use them to generate a probability that a user owns the bag after pulling two fruits out of it. There is an assumption that no user ever brings two of the same fruit.
import pandas as pd
df = pd.DataFrame({'user':['Matt', 'Matt', 'Matt', 'Matt', 'Tom', 'Tom', 'Tom', 'Tom', 'Nick', 'Nick', 'Nick', 'Nick', 'Nick'], 'fruit': ['Plum', 'Apple', 'Orange', 'Pear', 'Grape', 'Apple', 'Orange', 'Banana', 'Orange', 'Grape', 'Apple', 'Banana', 'Tomato']})[['user', 'fruit']]
print df
My thought was to merge the dataframe back onto itself on user, and generate counts based on unique pairs of fruit_x and fruit_y.
df_merged = df.merge(df, how='inner', on='user')
print df_merged
Unfortunately the merge yields two types of unwanted results. Instances where a fruit has been merged back onto itself are easy to fix.
df_fix1 = df_merged.query('fruit_x != fruit_y')
gb_pair_user = df_fix1.groupby(['user', 'fruit_x', 'fruit_y'])
df_fix1['pair_user_count'] = gb_pair_user['user'].transform('count')
gb_pair = df_fix1.groupby(['fruit_x', 'fruit_y'])
df_fix1['pair_count'] = gb_pair['user'].transform('count')
df_fix1['probability'] = df_fix1['pair_user_count'] / df_fix1['pair_count'] *1.0
print df_fix1[['fruit_x', 'fruit_y', 'probability', 'user']]
The second type is where I'm stuck. There is no meaningful difference between Apple+Orange and Orange+Apple, so I'd like to remove one of those rows. If there is a way to get proper combinations, I'd be very interested in that, otherwise if anyone can suggest a hack to eliminate the duplicated information that would be great too.
You can take the advantage of combinations from itertools to create unique pair of combination of fruits for each user.
from itertools import combinations
def func(group):
return pd.DataFrame(list(combinations(group.fruit, 2)), columns=['fruit_x', 'fruit_y'])
df.groupby('user').apply(func).reset_index(level=1, drop=True)
fruit_x fruit_y
user
Matt Plum Apple
Matt Plum Orange
Matt Plum Pear
Matt Apple Orange
Matt Apple Pear
Matt Orange Pear
Nick Orange Grape
Nick Orange Apple
Nick Orange Banana
Nick Orange Tomato
Nick Grape Apple
Nick Grape Banana
Nick Grape Tomato
Nick Apple Banana
Nick Apple Tomato
Nick Banana Tomato
Tom Grape Apple
Tom Grape Orange
Tom Grape Banana
Tom Apple Orange
Tom Apple Banana
Tom Orange Banana
You can then calculate the probability according to your program logic.