Recursively add text from one row to another - python

I want to add text from one row to another using conditional joining. Here is a sample dataset:
import pandas as pd
df = pd.DataFrame({'ID' : ['A','A','A','B','B','B','B','B'],
'Meal' : [1,2,3,1,2,3,4,5],
'Solo' : [1,0,1,1,0,0,0,0],
'Dependency' : [0,1,0,0,1,2,2,1],
'Food': ['Steak','Eggs and meal 1','Lamb','Chicken',
'Steak and meal 1 with eggs','Soup and meal 2',
'Water and meal 2 with meal 1','Ham with meal 1']
})
Resulting DataFrame:
ID Meal Solo Dependency Food
0 A 1 1 0 Steak
1 A 2 0 1 Eggs and meal 1
2 A 3 1 0 Lamb
3 B 1 1 0 Chicken
4 B 2 0 1 Steak and meal 1 with eggs
5 B 3 0 2 Soup and meal 2
6 B 4 0 2 Water and meal 2 with meal 1
7 B 5 0 1 Ham with meal 1
I want to create a column with combined meal information:
ID Meal Combined
0 A 1 Steak
1 A 2 Eggs and Steak
2 A 3 Lamb
3 B 1 Chicken
4 B 2 Steak and Chicken with eggs
5 B 3 Soup and Steak and Chicken with eggs
6 B 4 Water and Steak and Chicken with eggs
7 B 5 Ham with Chicken
Any help would be much appreciated.
Thanks!

Related

Count value pairings from different columns in a DataFrame with Pandas

I have a df like this one:
df = pd.DataFrame([["coffee","soda","coffee","water","soda","soda"],["paper","glass","glass","paper","paper","glass"], list('smlssm')]).T
df.columns = ['item','cup','size']
df:
item cup size
0 coffee paper s
1 soda glass m
2 coffee glass l
3 water paper s
4 soda paper s
5 soda glass m
I want to transform this into a df that looks like this
item cup size freq
0 coffee paper s 1
1 coffee paper m 0
2 coffee paper l 0
3 coffee glass s 0
4 coffee glass m 0
5 coffee glass l 1
6 soda paper s 1
7 soda paper m 0
8 soda paper l 0
9 soda glass s 0
10 soda glass m 2
11 soda glass l 0
. . . . .
. . . . .
. . . . .
So for every item i want a row with the possible combinations of cup and size and an additional row with the frequency.
What is the proper way to do this using pandas?
Try:
df["freq"] = 1
x = df.pivot_table(
index="item",
columns=["cup", "size"],
values="freq",
aggfunc="sum",
fill_value=0,
)
full_cols = pd.MultiIndex.from_product(
[
x.columns.get_level_values(0).unique(),
x.columns.get_level_values(1).unique(),
],
names=x.columns.names,
)
x = x.reindex(full_cols, fill_value=0, axis=1)
print(x.stack([0, 1]).reset_index(name="freq"))
Prints:
item cup size freq
0 coffee glass l 1
1 coffee glass m 0
2 coffee glass s 0
3 coffee paper l 0
4 coffee paper m 0
5 coffee paper s 1
6 soda glass l 0
7 soda glass m 2
8 soda glass s 0
9 soda paper l 0
10 soda paper m 0
11 soda paper s 1
12 water glass l 0
13 water glass m 0
14 water glass s 0
15 water paper l 0
16 water paper m 0
17 water paper s 1
Dataframe used:
item cup size
0 coffee paper s
1 soda glass m
2 coffee glass l
3 water paper s
4 soda paper s
5 soda glass m
Let's try:
Add a frequency column to the dataframe to indicate individual rows are worth 1 each.
groupby sum to get the current count in the DataFrame.
Create a MultiIndex from the unique values in each column.
Use the new midx to reindex with a fill_value=0 so that freq gets filled with 0 when created by the new index.
reset_index to convert the index back into columns.
# Columns to Reindex
idx_cols = ['item', 'cup', 'size']
# Create MultIndex With Unique Values
midx = pd.MultiIndex.from_product(
[df[c].unique() for c in idx_cols],
names=idx_cols
)
df = (
df.assign(freq=1) # Add Freq Column initialzed to 1
.groupby(idx_cols)['freq'].sum() # Groupby and Sum freq
.reindex(midx, fill_value=0) # reindex
.reset_index() # reset_index
)
df:
item cup size freq
0 coffee paper s 1
1 coffee paper m 0
2 coffee paper l 0
3 coffee glass s 0
4 coffee glass m 0
5 coffee glass l 1
6 soda paper s 1
7 soda paper m 0
8 soda paper l 0
9 soda glass s 0
10 soda glass m 2
11 soda glass l 0
12 water paper s 1
13 water paper m 0
14 water paper l 0
15 water glass s 0
16 water glass m 0
17 water glass l 0
By using merge():
import itertools as it
dfa = df.groupby(['item','cup','size']).size().reset_index(name='freq')
dfb = pd.DataFrame(
list(it.product(
df['item'].unique(),df['cup'].unique(),df['size'].unique())),
columns=dfa.columns[:-1])
dfa.merge(dfb, how='outer').fillna(0) \
.sort_values(by=dfb.columns.to_list(), ascending=[True,True,False]) \
.reset_index(drop=True).astype(int, errors='ignore')
item cup size freq
0 coffee glass s 0
1 coffee glass m 0
2 coffee glass l 1
3 coffee paper s 1
4 coffee paper m 0
5 coffee paper l 0
6 soda glass s 0
7 soda glass m 2
8 soda glass l 0
9 soda paper s 1
10 soda paper m 0
11 soda paper l 0
12 water glass s 0
13 water glass m 0
14 water glass l 0
15 water paper s 1
16 water paper m 0
17 water paper l 0
You can use another approach. Check it out itertools.product and itertools.combination:
import itertools as it
# It's necessary a copy of your original dataframe to find the frequences,
# because df variable will be changed in the for loop.
df_cop = df.copy()
for index, (item, cup, size) in enumerate(it.product(df['item'].unique(), df['cup'].unique(), df['size'].unique())):
row = list(it.combinations([item, cup, size], 3))[:3]
df.loc[index, :'size'] = row[0]
df.loc[index, 'freq'] = df_cop.values.tolist().count(list(row[0]))
print(df)
Output:
item cup size freq
0 coffee paper s 1
1 coffee paper m 0
2 coffee paper l 0
3 coffee glass s 0
4 coffee glass m 0
5 coffee glass l 1
6 soda paper s 1
7 soda paper m 0
8 soda paper l 0
9 soda glass s 0
10 soda glass m 2
11 soda glass l 0
12 water paper s 1
13 water paper m 0
14 water paper l 0
15 water glass s 0
16 water glass m 0
17 water glass l 0

Modify DataFrame based on another DataFrame in Pandas

I have these two dataframes
df1
Product Quantity Price Description
0 bread 3 12 desc1
1 cookie 5 10 desc2
2 milk 7 15 desc3
3 sugar 4 7 desc4
4 chocolate 5 9 desc5
df2
Attribute Configuration
0 Product C
1 Quantity C
2 Price D
3 Description D
What I'm trying to do is if the letter D is in the Configuration column in df2. The entire row is deleted in df1.
So that df2 is like the way to create another dataframe with the configuration that this gives me.
The condition could be...
if df2.Configuration == 'D'
df1.drop when df1.header = df2.Attribute
I kind of give that idea but I'm not sure it's like that. What I can do?
The result should look like this...
df3
Product Quantity
0 bread 3
1 cookie 5
2 milk 7
3 sugar 4
4 chocolate 5
Using
df1.drop(df2.loc[df2.Configuration=='D','Attribute'].tolist(),1)
Product Quantity
0 bread 3
1 cookie 5
2 milk 7
3 sugar 4
4 chocolate 5

How do I sum values from one column dependent on items in other columns?

I have the following dataframe:
Course Orders Ingredient 1 Ingredient 2 Ingredient 3
starter 3 Fish Bread Mayonnaise
starter 1 Olives Bread
starter 5 Hummus Pita
main 1 Pizza
main 6 Beef Potato Peas
main 9 Fish Peas
main 11 Bread Mayonnaise Beef
main 4 Pasta Bolognese Peas
desert 10 Cheese Olives Crackers
desert 7 Cookies Cream
desert 8 Cheesecake Cream
I would like to sum the number of orders for each ingredient per course. It is not important which column the ingredient is in.
The following dataframe is what I would like my output to be:
Course Ord Ing1 IngOrd1 Ing2 IngOrd2 Ing3 IngOrd3
starter 3 Fish 3 Bread 4 Mayo 3
starter 1 Olives 1 Bread 4
starter 5 Hummus 5 Pita 5
main 1 Pizza 1
main 6 Beef 17 Potato 6 Peas 21
main 9 Fish 9 Peas 21
main 11 Bread 11 Mayo 11 Beef 17
main 4 Pasta 4 Bolognese 4 Peas 21
desert 10 Cheese 10 Olives 10 Crackers 10
desert 7 Cookies 7 Cream 15
desert 8 Cheesecake 8 Cream 15
I have tried using groupby().sum() but this does not work with the ingredients in 3 columns.
I also cannot use lookup because there are instances in the full dataframe where I do not know what ingredient I am looking for.
I don't believe there's really slick way to this with groupby or other such pandas methods, though I'm happy to be proven wrong. In any case, the following is not especially pretty, but it will give you what you're after.
import pandas as pd
from collections import defaultdict
# The data you provided
df = pd.read_csv('orders.csv')
# Group these labels for convenience
ingredients = ['Ingredient 1', 'Ingredient 2', 'Ingredient 3']
orders = ['IngOrd1', 'IngOrd2', 'IngOrd3']
# Interleave the two lists for final data frame
combined = [y for x in zip(ingredients, orders) for y in x]
# Restructure the data frame so we can group on ingredients
melted = pd.melt(df, id_vars=['Course', 'Orders'], value_vars=ingredients, value_name='Ingredient')
# This is a map that we can apply to each ingredient column to
# look up the correct order count
maps = defaultdict(lambda: defaultdict(int))
# Build the map. Every course/ingredient pair is keyed to the total
# count for that pair, e.g. {(main, beef): 17, ...}
for index, group in melted.groupby(['Course', 'Ingredient']):
course, ingredient = index
maps[course][ingredient] += group.Orders.sum()
# Now apply the map to each ingredient column of the data frame
# to create the new count columns
for i, o in zip(ingredients, orders):
df[o] = df.apply(lambda x: maps[x.Course][x[i]], axis=1)
# Adjust the columns labels
df = df[['Course', 'Orders'] + combined]
print df
Course Orders Ingredient 1 IngOrd1 Ingredient 2 IngOrd2 Ingredient 3 IngOrd3
0 starter 3 Fish 3 Bread 4 Mayonnaise 3
1 starter 1 Olives 1 Bread 4 NaN 0
2 starter 5 Hummus 5 Pita 5 NaN 0
3 main 1 Pizza 1 NaN 0 NaN 0
4 main 6 Beef 17 Potato 6 Peas 19
5 main 9 Fish 9 Peas 19 NaN 0
6 main 11 Bread 11 Mayonnaise 11 Beef 17
7 main 4 Pasta 4 Bolognese 4 Peas 19
8 desert 10 Cheese 10 Olives 10 Crackers 10
9 desert 7 Cookies 7 Cream 15 NaN 0
10 desert 8 Cheesecake 8 Cream 15 NaN 0
You'll need to handle the NaNs and 0 counts if that's an issue. But that's a trivial task.

Using Pandas Data Frame how to apply count to multi level grouped columns?

I have a data frame with multiple columns and I want to use count after group by such that it is applied to the combination of 2 or more columns. for example, let's say I have two columns:
user_id product_name
1 Apple
1 Banana
1 Apple
2 Carrot
2 Tomato
2 Carrot
2 Tomato
3 Milk
3 Cucumber
...
What I want to achieve is something like this:
user_id product_name Product_Count_per_User
1 Apple 1
1 Banana 2
2 Carrot 2
2 Tomato 2
3 Milk 1
3 Cucumber 1
I cannot get it. I tried this:
dcf6 = df3.groupby(['user_id','product_name'])['user_id', 'product_name'].count()
but it does not seem to get what I want and it is displaying 4 columns instead of 3. How to do to it? Thanks.
You are counting two columns at the same time, you can just use groupby.size:
(df.groupby(['user_id', 'Product_Name']).size()
.rename('Product_Count_per_User').reset_index())
Or count only one column:
df.groupby(['user_id','Product_Name'])['user_id'].size()
Use GroupBy.size:
dcf6 = df3.groupby(['user_id','Product_Name']).size()
.reset_index(name='Product_Count_per_User')
print (dcf6)
user_id Product_Name Product_Count_per_User
0 1 Apple 2
1 1 Banana 1
2 2 Carrot 2
3 2 Tomato 2
4 3 Cucumber 1
5 3 Milk 1
What is the difference between size and count in pandas?
Base on your own code , just do this .
df.groupby(['user_id','product_name'])['user_id'].
agg({'Product_Count_per_User':'count'}).reset_index(level=1)
product_name Product_Count_per_User
user_id
1 Apple 2
1 Banana 1
2 Carrot 2
2 Tomato 2
3 Cucumber 1
3 Milk 1

How to do arithmetic on tidy data in pandas?

I have a DataFrame in "tidy" format (columns are variables, rows are observations) containing time series data for several different conditions. I'd like to normalize the data to the zero-hour time point for each condition.
For example, lets say I fed two different animals two different kinds of meal, then every hour I recorded how much food was left:
In [4]: df
Out[4]:
animal meal time food_left
0 lion meat 0 10
1 lion meat 1 5
2 lion meat 2 2
3 tiger meat 0 5
4 tiger meat 1 3
5 tiger meat 2 2
6 lion vegetable 0 5
7 lion vegetable 1 5
8 lion vegetable 2 5
9 tiger vegetable 0 5
10 tiger vegetable 1 5
11 tiger vegetable 2 5
For each time point, I want to calculate how much food a particular animal has eaten (food_eaten) by subtracting food_left at that time point from food_left at time point zero (for that animal and meal), then store the result in another column, e.g.:
animal meal time food_left food_eaten
0 lion meat 0 10 0
1 lion meat 1 5 5
2 lion meat 2 2 8
3 tiger meat 0 5 0
4 tiger meat 1 3 2
5 tiger meat 2 2 3
6 lion vegetable 0 5 0
7 lion vegetable 1 5 0
8 lion vegetable 2 5 0
9 tiger vegetable 0 5 0
10 tiger vegetable 1 5 0
11 tiger vegetable 2 5 0
I'm struggling to figure out how to apply this transformation in Pandas to produce the final data frame (preferably also in tidy format). Importantly, I need to keep the metadata (animal, meal, etc.).
Preferably I'd like a solution which generalizes to different groupings and transformations; for instance, what if I want to divide the amount the tiger ate at each time point by the amount the lion ate (for a given meal) at that time point, or find out how much less the lion ate of vegetables than meat, and so on.
Things I've tried:
groupby:
In [15]: df2 = df.set_index(['time'])
In [16]: df2.groupby(['animal','meal']).transform(lambda x: x[0] - x)
Out[16]:
food_left
time
0 0
1 5
2 8
0 0
1 2
2 3
0 0
1 0
2 0
0 0
1 0
2 0
Result is correct, but the metadata is lost, and I can't join it back to the original df
If I set_index on ['time', 'animal', 'meal'], then I can't groupby:
In [17]: df2 = df.set_index(['time','animal','meal'])
In [19]: df2.groupby(['animal','meal']).transform(lambda x: x[0] - x)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
... snip ...
KeyError: 'animal'
pivot:
In [21]: data_pivot = df.pivot_table(columns=['animal','meal'],index=['time'],values='food_left')
In [22]: data_norm = data_pivot.rsub(data_pivot.loc[0], axis=1)
In [23]: data_norm
Out[23]:
animal lion tiger
meal meat vegetable meat vegetable
time
0 0 0 0 0
1 5 0 2 0
2 8 0 3 0
This is a bit better and I could probably retrieve the original data with melt or unstack, but it seems inelegant. Is there a better way?
You can create a new column based on the transformed data, as a one-liner, it would be:
df['food_eaten'] = df.set_index(['time']).groupby(['animal', 'meal']).
transform(lambda x: x[0] - x).values
df
animal meal time food_left food_eaten
0 lion meat 0 10 0
1 lion meat 1 5 5
2 lion meat 2 2 8
3 tiger meat 0 5 0
4 tiger meat 1 3 2
5 tiger meat 2 2 3
6 lion vegetable 0 5 0
7 lion vegetable 1 5 0
8 lion vegetable 2 5 0
9 tiger vegetable 0 5 0
10 tiger vegetable 1 5 0
11 tiger vegetable 2 5 0
You want to use groupby and diff:
df['food_eaten'] = -df.groupby(['animal', 'meal'])['food_left'].diff()
Follow that with fillna() if you want zeroes rather than NaN for situations where nothing was eaten. While this doesn't directly generalize you now have the amount of food of each type eaten by each animal in each time interval, so you can do the additional computations on this new field.

Categories