Count value pairings from different columns in a DataFrame with Pandas - python

I have a df like this one:
df = pd.DataFrame([["coffee","soda","coffee","water","soda","soda"],["paper","glass","glass","paper","paper","glass"], list('smlssm')]).T
df.columns = ['item','cup','size']
df:
item cup size
0 coffee paper s
1 soda glass m
2 coffee glass l
3 water paper s
4 soda paper s
5 soda glass m
I want to transform this into a df that looks like this
item cup size freq
0 coffee paper s 1
1 coffee paper m 0
2 coffee paper l 0
3 coffee glass s 0
4 coffee glass m 0
5 coffee glass l 1
6 soda paper s 1
7 soda paper m 0
8 soda paper l 0
9 soda glass s 0
10 soda glass m 2
11 soda glass l 0
. . . . .
. . . . .
. . . . .
So for every item i want a row with the possible combinations of cup and size and an additional row with the frequency.
What is the proper way to do this using pandas?

Try:
df["freq"] = 1
x = df.pivot_table(
index="item",
columns=["cup", "size"],
values="freq",
aggfunc="sum",
fill_value=0,
)
full_cols = pd.MultiIndex.from_product(
[
x.columns.get_level_values(0).unique(),
x.columns.get_level_values(1).unique(),
],
names=x.columns.names,
)
x = x.reindex(full_cols, fill_value=0, axis=1)
print(x.stack([0, 1]).reset_index(name="freq"))
Prints:
item cup size freq
0 coffee glass l 1
1 coffee glass m 0
2 coffee glass s 0
3 coffee paper l 0
4 coffee paper m 0
5 coffee paper s 1
6 soda glass l 0
7 soda glass m 2
8 soda glass s 0
9 soda paper l 0
10 soda paper m 0
11 soda paper s 1
12 water glass l 0
13 water glass m 0
14 water glass s 0
15 water paper l 0
16 water paper m 0
17 water paper s 1
Dataframe used:
item cup size
0 coffee paper s
1 soda glass m
2 coffee glass l
3 water paper s
4 soda paper s
5 soda glass m

Let's try:
Add a frequency column to the dataframe to indicate individual rows are worth 1 each.
groupby sum to get the current count in the DataFrame.
Create a MultiIndex from the unique values in each column.
Use the new midx to reindex with a fill_value=0 so that freq gets filled with 0 when created by the new index.
reset_index to convert the index back into columns.
# Columns to Reindex
idx_cols = ['item', 'cup', 'size']
# Create MultIndex With Unique Values
midx = pd.MultiIndex.from_product(
[df[c].unique() for c in idx_cols],
names=idx_cols
)
df = (
df.assign(freq=1) # Add Freq Column initialzed to 1
.groupby(idx_cols)['freq'].sum() # Groupby and Sum freq
.reindex(midx, fill_value=0) # reindex
.reset_index() # reset_index
)
df:
item cup size freq
0 coffee paper s 1
1 coffee paper m 0
2 coffee paper l 0
3 coffee glass s 0
4 coffee glass m 0
5 coffee glass l 1
6 soda paper s 1
7 soda paper m 0
8 soda paper l 0
9 soda glass s 0
10 soda glass m 2
11 soda glass l 0
12 water paper s 1
13 water paper m 0
14 water paper l 0
15 water glass s 0
16 water glass m 0
17 water glass l 0

By using merge():
import itertools as it
dfa = df.groupby(['item','cup','size']).size().reset_index(name='freq')
dfb = pd.DataFrame(
list(it.product(
df['item'].unique(),df['cup'].unique(),df['size'].unique())),
columns=dfa.columns[:-1])
dfa.merge(dfb, how='outer').fillna(0) \
.sort_values(by=dfb.columns.to_list(), ascending=[True,True,False]) \
.reset_index(drop=True).astype(int, errors='ignore')
item cup size freq
0 coffee glass s 0
1 coffee glass m 0
2 coffee glass l 1
3 coffee paper s 1
4 coffee paper m 0
5 coffee paper l 0
6 soda glass s 0
7 soda glass m 2
8 soda glass l 0
9 soda paper s 1
10 soda paper m 0
11 soda paper l 0
12 water glass s 0
13 water glass m 0
14 water glass l 0
15 water paper s 1
16 water paper m 0
17 water paper l 0

You can use another approach. Check it out itertools.product and itertools.combination:
import itertools as it
# It's necessary a copy of your original dataframe to find the frequences,
# because df variable will be changed in the for loop.
df_cop = df.copy()
for index, (item, cup, size) in enumerate(it.product(df['item'].unique(), df['cup'].unique(), df['size'].unique())):
row = list(it.combinations([item, cup, size], 3))[:3]
df.loc[index, :'size'] = row[0]
df.loc[index, 'freq'] = df_cop.values.tolist().count(list(row[0]))
print(df)
Output:
item cup size freq
0 coffee paper s 1
1 coffee paper m 0
2 coffee paper l 0
3 coffee glass s 0
4 coffee glass m 0
5 coffee glass l 1
6 soda paper s 1
7 soda paper m 0
8 soda paper l 0
9 soda glass s 0
10 soda glass m 2
11 soda glass l 0
12 water paper s 1
13 water paper m 0
14 water paper l 0
15 water glass s 0
16 water glass m 0
17 water glass l 0

Related

how to create a matrix when two values are in the same groupby column pandas?

So i basically have a dataframe of products and orders:
product order
apple 111
orange 111
apple 121
beans 121
rice 131
orange 131
apple 141
orange 141
What i need to do is, groupby the products based on the id of the order, and generate this matrix with the value of times they appeared together in the same order.
I don't know any efficient way of doing this, if someone could help me!
apple orange beans rice
apple x 2 1 0
orange 2 x 0 1
beans 1 0 x 0
rice 0 1 0 x
One option is to join the dataframe with itself on order and then calculate the cooccurrences using crosstab on the two product columns:
df.merge(df, on='order').pipe(lambda df: pd.crosstab(df.product_x, df.product_y))
product_y apple beans orange rice
product_x
apple 3 1 2 0
beans 1 1 0 0
orange 2 0 3 1
rice 0 0 1 1
Another way is to perform a crosstab between product and order, then do a matrix multiplication # with the transpose so:
a_ = pd.crosstab(df['product'], df['order'])
res = a_#a_.T
print(res)
product apple beans orange rice
product
apple 3 1 2 0
beans 1 1 0 0
orange 2 0 3 1
rice 0 0 1 1
or using pipe to do a one liner:
res = pd.crosstab(df['product'], df['order']).pipe(lambda x: x#x.T)

How do I select suitable rows from different relevant columns? (pandas Dataframe)

Everyone.I am the beginner for Pandas.
My aim: select the most valuable team from the "team_list".
the most valuable team means: most goals,least Yellow and Red Cards .
the "team_list" consists of "Team","Goals","Yellow Cards","Red Cards" - four columns.
team_list shows
I want to solve the question like this,but it isn't python style. How can I do that?
sortGoals=euro.sort_values(by=['Goals'],ascending=False);
sortCards=sortGoals.sort_values(by=['Yellow Cards','Red Cards']);
print (sortCards.head(1));
the result :
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0
the team information :
euro=DataFrame({'Team':['Croatia','Czech
Republic','Denmark','England','France','Germany',
'Greece','Italy','Netherlands','Poland','Portugal','Republic of
Ireland','Russia','Spain','Sweden','Ukraine'],'Goals':[4,4,4,5,3,10,5,6,2,2,6,1,5,12,5,2],'Yellow
Cards':[9,7,4,5,6,4,9,16,5,7,12,6,6,11,7,5],'Red Cards':[0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0]})
euro:
Team Goals Yellow Cards Red Cards
0 Croatia 4 9 0
1 Czech Republic 4 7 0
2 Denmark 4 4 0
3 England 5 5 0
4 France 3 6 0
5 Germany 10 4 0
6 Greece 5 9 1
7 Italy 6 16 0
8 Netherlands 2 5 0
9 Poland 2 7 1
10 Portugal 6 12 0
11 Republic of Ireland 1 6 1
12 Russia 5 6 0
13 Spain 12 11 0
14 Sweden 5 7 0
15 Ukraine 2 5 0
Joran Beasley inspires me, thank you.
euro['RedCard_rate']=euro['Red Cards']/euro['Goals'];
euro['YellowCard_rate']=euro['Yellow Cards']/euro['Goals'];
sort_teams=euro.sort_values(by=['YellowCard_rate','RedCard_rate']);
print (sort_teams[['Team','Goals','Yellow Cards','Red Cards']].head(1));
the results:
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0
You can do this:
germany = euro.loc[euro.Team == 'Germany']
More on pandas here: https://pandas.pydata.org/docs/user_guide/index.html
Is this what your looking for?
df[df['Team'].eq('Germany')]
Team Goals Yellow Cards Red Cards
5 Germany 10 4 0
import pandas
df =pandas.DataFrame({'Team':['Croatia','Czech Republic',
'Denmark','England','France','Germany',
'Greece','Italy','Netherlands','Poland','Portugal','Republic of Ireland',
'Russia','Spain','Sweden','Ukraine'],
'Goals':[4,4,4,5,3,10,5,6,2,2,6,1,5,12,5,2],
'Yellow Cards':[9,7,4,5,6,4,9,16,5,7,12,6,6,11,7,5],
'Red Cards':[0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0]})
scores = df['Goals'] - df['Yellow Cards'] - df['Red Cards']
df2 = pandas.DataFrame({'Team': df['Team'],'score':scores})
print(df2['Team'][df2['score'].idxmax()])
is that what you mean?

Recursively add text from one row to another

I want to add text from one row to another using conditional joining. Here is a sample dataset:
import pandas as pd
df = pd.DataFrame({'ID' : ['A','A','A','B','B','B','B','B'],
'Meal' : [1,2,3,1,2,3,4,5],
'Solo' : [1,0,1,1,0,0,0,0],
'Dependency' : [0,1,0,0,1,2,2,1],
'Food': ['Steak','Eggs and meal 1','Lamb','Chicken',
'Steak and meal 1 with eggs','Soup and meal 2',
'Water and meal 2 with meal 1','Ham with meal 1']
})
Resulting DataFrame:
ID Meal Solo Dependency Food
0 A 1 1 0 Steak
1 A 2 0 1 Eggs and meal 1
2 A 3 1 0 Lamb
3 B 1 1 0 Chicken
4 B 2 0 1 Steak and meal 1 with eggs
5 B 3 0 2 Soup and meal 2
6 B 4 0 2 Water and meal 2 with meal 1
7 B 5 0 1 Ham with meal 1
I want to create a column with combined meal information:
ID Meal Combined
0 A 1 Steak
1 A 2 Eggs and Steak
2 A 3 Lamb
3 B 1 Chicken
4 B 2 Steak and Chicken with eggs
5 B 3 Soup and Steak and Chicken with eggs
6 B 4 Water and Steak and Chicken with eggs
7 B 5 Ham with Chicken
Any help would be much appreciated.
Thanks!

How do I sum values from one column dependent on items in other columns?

I have the following dataframe:
Course Orders Ingredient 1 Ingredient 2 Ingredient 3
starter 3 Fish Bread Mayonnaise
starter 1 Olives Bread
starter 5 Hummus Pita
main 1 Pizza
main 6 Beef Potato Peas
main 9 Fish Peas
main 11 Bread Mayonnaise Beef
main 4 Pasta Bolognese Peas
desert 10 Cheese Olives Crackers
desert 7 Cookies Cream
desert 8 Cheesecake Cream
I would like to sum the number of orders for each ingredient per course. It is not important which column the ingredient is in.
The following dataframe is what I would like my output to be:
Course Ord Ing1 IngOrd1 Ing2 IngOrd2 Ing3 IngOrd3
starter 3 Fish 3 Bread 4 Mayo 3
starter 1 Olives 1 Bread 4
starter 5 Hummus 5 Pita 5
main 1 Pizza 1
main 6 Beef 17 Potato 6 Peas 21
main 9 Fish 9 Peas 21
main 11 Bread 11 Mayo 11 Beef 17
main 4 Pasta 4 Bolognese 4 Peas 21
desert 10 Cheese 10 Olives 10 Crackers 10
desert 7 Cookies 7 Cream 15
desert 8 Cheesecake 8 Cream 15
I have tried using groupby().sum() but this does not work with the ingredients in 3 columns.
I also cannot use lookup because there are instances in the full dataframe where I do not know what ingredient I am looking for.
I don't believe there's really slick way to this with groupby or other such pandas methods, though I'm happy to be proven wrong. In any case, the following is not especially pretty, but it will give you what you're after.
import pandas as pd
from collections import defaultdict
# The data you provided
df = pd.read_csv('orders.csv')
# Group these labels for convenience
ingredients = ['Ingredient 1', 'Ingredient 2', 'Ingredient 3']
orders = ['IngOrd1', 'IngOrd2', 'IngOrd3']
# Interleave the two lists for final data frame
combined = [y for x in zip(ingredients, orders) for y in x]
# Restructure the data frame so we can group on ingredients
melted = pd.melt(df, id_vars=['Course', 'Orders'], value_vars=ingredients, value_name='Ingredient')
# This is a map that we can apply to each ingredient column to
# look up the correct order count
maps = defaultdict(lambda: defaultdict(int))
# Build the map. Every course/ingredient pair is keyed to the total
# count for that pair, e.g. {(main, beef): 17, ...}
for index, group in melted.groupby(['Course', 'Ingredient']):
course, ingredient = index
maps[course][ingredient] += group.Orders.sum()
# Now apply the map to each ingredient column of the data frame
# to create the new count columns
for i, o in zip(ingredients, orders):
df[o] = df.apply(lambda x: maps[x.Course][x[i]], axis=1)
# Adjust the columns labels
df = df[['Course', 'Orders'] + combined]
print df
Course Orders Ingredient 1 IngOrd1 Ingredient 2 IngOrd2 Ingredient 3 IngOrd3
0 starter 3 Fish 3 Bread 4 Mayonnaise 3
1 starter 1 Olives 1 Bread 4 NaN 0
2 starter 5 Hummus 5 Pita 5 NaN 0
3 main 1 Pizza 1 NaN 0 NaN 0
4 main 6 Beef 17 Potato 6 Peas 19
5 main 9 Fish 9 Peas 19 NaN 0
6 main 11 Bread 11 Mayonnaise 11 Beef 17
7 main 4 Pasta 4 Bolognese 4 Peas 19
8 desert 10 Cheese 10 Olives 10 Crackers 10
9 desert 7 Cookies 7 Cream 15 NaN 0
10 desert 8 Cheesecake 8 Cream 15 NaN 0
You'll need to handle the NaNs and 0 counts if that's an issue. But that's a trivial task.

How to do arithmetic on tidy data in pandas?

I have a DataFrame in "tidy" format (columns are variables, rows are observations) containing time series data for several different conditions. I'd like to normalize the data to the zero-hour time point for each condition.
For example, lets say I fed two different animals two different kinds of meal, then every hour I recorded how much food was left:
In [4]: df
Out[4]:
animal meal time food_left
0 lion meat 0 10
1 lion meat 1 5
2 lion meat 2 2
3 tiger meat 0 5
4 tiger meat 1 3
5 tiger meat 2 2
6 lion vegetable 0 5
7 lion vegetable 1 5
8 lion vegetable 2 5
9 tiger vegetable 0 5
10 tiger vegetable 1 5
11 tiger vegetable 2 5
For each time point, I want to calculate how much food a particular animal has eaten (food_eaten) by subtracting food_left at that time point from food_left at time point zero (for that animal and meal), then store the result in another column, e.g.:
animal meal time food_left food_eaten
0 lion meat 0 10 0
1 lion meat 1 5 5
2 lion meat 2 2 8
3 tiger meat 0 5 0
4 tiger meat 1 3 2
5 tiger meat 2 2 3
6 lion vegetable 0 5 0
7 lion vegetable 1 5 0
8 lion vegetable 2 5 0
9 tiger vegetable 0 5 0
10 tiger vegetable 1 5 0
11 tiger vegetable 2 5 0
I'm struggling to figure out how to apply this transformation in Pandas to produce the final data frame (preferably also in tidy format). Importantly, I need to keep the metadata (animal, meal, etc.).
Preferably I'd like a solution which generalizes to different groupings and transformations; for instance, what if I want to divide the amount the tiger ate at each time point by the amount the lion ate (for a given meal) at that time point, or find out how much less the lion ate of vegetables than meat, and so on.
Things I've tried:
groupby:
In [15]: df2 = df.set_index(['time'])
In [16]: df2.groupby(['animal','meal']).transform(lambda x: x[0] - x)
Out[16]:
food_left
time
0 0
1 5
2 8
0 0
1 2
2 3
0 0
1 0
2 0
0 0
1 0
2 0
Result is correct, but the metadata is lost, and I can't join it back to the original df
If I set_index on ['time', 'animal', 'meal'], then I can't groupby:
In [17]: df2 = df.set_index(['time','animal','meal'])
In [19]: df2.groupby(['animal','meal']).transform(lambda x: x[0] - x)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
... snip ...
KeyError: 'animal'
pivot:
In [21]: data_pivot = df.pivot_table(columns=['animal','meal'],index=['time'],values='food_left')
In [22]: data_norm = data_pivot.rsub(data_pivot.loc[0], axis=1)
In [23]: data_norm
Out[23]:
animal lion tiger
meal meat vegetable meat vegetable
time
0 0 0 0 0
1 5 0 2 0
2 8 0 3 0
This is a bit better and I could probably retrieve the original data with melt or unstack, but it seems inelegant. Is there a better way?
You can create a new column based on the transformed data, as a one-liner, it would be:
df['food_eaten'] = df.set_index(['time']).groupby(['animal', 'meal']).
transform(lambda x: x[0] - x).values
df
animal meal time food_left food_eaten
0 lion meat 0 10 0
1 lion meat 1 5 5
2 lion meat 2 2 8
3 tiger meat 0 5 0
4 tiger meat 1 3 2
5 tiger meat 2 2 3
6 lion vegetable 0 5 0
7 lion vegetable 1 5 0
8 lion vegetable 2 5 0
9 tiger vegetable 0 5 0
10 tiger vegetable 1 5 0
11 tiger vegetable 2 5 0
You want to use groupby and diff:
df['food_eaten'] = -df.groupby(['animal', 'meal'])['food_left'].diff()
Follow that with fillna() if you want zeroes rather than NaN for situations where nothing was eaten. While this doesn't directly generalize you now have the amount of food of each type eaten by each animal in each time interval, so you can do the additional computations on this new field.

Categories