How to do arithmetic on tidy data in pandas? - python

I have a DataFrame in "tidy" format (columns are variables, rows are observations) containing time series data for several different conditions. I'd like to normalize the data to the zero-hour time point for each condition.
For example, lets say I fed two different animals two different kinds of meal, then every hour I recorded how much food was left:
In [4]: df
Out[4]:
animal meal time food_left
0 lion meat 0 10
1 lion meat 1 5
2 lion meat 2 2
3 tiger meat 0 5
4 tiger meat 1 3
5 tiger meat 2 2
6 lion vegetable 0 5
7 lion vegetable 1 5
8 lion vegetable 2 5
9 tiger vegetable 0 5
10 tiger vegetable 1 5
11 tiger vegetable 2 5
For each time point, I want to calculate how much food a particular animal has eaten (food_eaten) by subtracting food_left at that time point from food_left at time point zero (for that animal and meal), then store the result in another column, e.g.:
animal meal time food_left food_eaten
0 lion meat 0 10 0
1 lion meat 1 5 5
2 lion meat 2 2 8
3 tiger meat 0 5 0
4 tiger meat 1 3 2
5 tiger meat 2 2 3
6 lion vegetable 0 5 0
7 lion vegetable 1 5 0
8 lion vegetable 2 5 0
9 tiger vegetable 0 5 0
10 tiger vegetable 1 5 0
11 tiger vegetable 2 5 0
I'm struggling to figure out how to apply this transformation in Pandas to produce the final data frame (preferably also in tidy format). Importantly, I need to keep the metadata (animal, meal, etc.).
Preferably I'd like a solution which generalizes to different groupings and transformations; for instance, what if I want to divide the amount the tiger ate at each time point by the amount the lion ate (for a given meal) at that time point, or find out how much less the lion ate of vegetables than meat, and so on.
Things I've tried:
groupby:
In [15]: df2 = df.set_index(['time'])
In [16]: df2.groupby(['animal','meal']).transform(lambda x: x[0] - x)
Out[16]:
food_left
time
0 0
1 5
2 8
0 0
1 2
2 3
0 0
1 0
2 0
0 0
1 0
2 0
Result is correct, but the metadata is lost, and I can't join it back to the original df
If I set_index on ['time', 'animal', 'meal'], then I can't groupby:
In [17]: df2 = df.set_index(['time','animal','meal'])
In [19]: df2.groupby(['animal','meal']).transform(lambda x: x[0] - x)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
... snip ...
KeyError: 'animal'
pivot:
In [21]: data_pivot = df.pivot_table(columns=['animal','meal'],index=['time'],values='food_left')
In [22]: data_norm = data_pivot.rsub(data_pivot.loc[0], axis=1)
In [23]: data_norm
Out[23]:
animal lion tiger
meal meat vegetable meat vegetable
time
0 0 0 0 0
1 5 0 2 0
2 8 0 3 0
This is a bit better and I could probably retrieve the original data with melt or unstack, but it seems inelegant. Is there a better way?

You can create a new column based on the transformed data, as a one-liner, it would be:
df['food_eaten'] = df.set_index(['time']).groupby(['animal', 'meal']).
transform(lambda x: x[0] - x).values
df
animal meal time food_left food_eaten
0 lion meat 0 10 0
1 lion meat 1 5 5
2 lion meat 2 2 8
3 tiger meat 0 5 0
4 tiger meat 1 3 2
5 tiger meat 2 2 3
6 lion vegetable 0 5 0
7 lion vegetable 1 5 0
8 lion vegetable 2 5 0
9 tiger vegetable 0 5 0
10 tiger vegetable 1 5 0
11 tiger vegetable 2 5 0

You want to use groupby and diff:
df['food_eaten'] = -df.groupby(['animal', 'meal'])['food_left'].diff()
Follow that with fillna() if you want zeroes rather than NaN for situations where nothing was eaten. While this doesn't directly generalize you now have the amount of food of each type eaten by each animal in each time interval, so you can do the additional computations on this new field.

Related

how to create a matrix when two values are in the same groupby column pandas?

So i basically have a dataframe of products and orders:
product order
apple 111
orange 111
apple 121
beans 121
rice 131
orange 131
apple 141
orange 141
What i need to do is, groupby the products based on the id of the order, and generate this matrix with the value of times they appeared together in the same order.
I don't know any efficient way of doing this, if someone could help me!
apple orange beans rice
apple x 2 1 0
orange 2 x 0 1
beans 1 0 x 0
rice 0 1 0 x
One option is to join the dataframe with itself on order and then calculate the cooccurrences using crosstab on the two product columns:
df.merge(df, on='order').pipe(lambda df: pd.crosstab(df.product_x, df.product_y))
product_y apple beans orange rice
product_x
apple 3 1 2 0
beans 1 1 0 0
orange 2 0 3 1
rice 0 0 1 1
Another way is to perform a crosstab between product and order, then do a matrix multiplication # with the transpose so:
a_ = pd.crosstab(df['product'], df['order'])
res = a_#a_.T
print(res)
product apple beans orange rice
product
apple 3 1 2 0
beans 1 1 0 0
orange 2 0 3 1
rice 0 0 1 1
or using pipe to do a one liner:
res = pd.crosstab(df['product'], df['order']).pipe(lambda x: x#x.T)

Recursively add text from one row to another

I want to add text from one row to another using conditional joining. Here is a sample dataset:
import pandas as pd
df = pd.DataFrame({'ID' : ['A','A','A','B','B','B','B','B'],
'Meal' : [1,2,3,1,2,3,4,5],
'Solo' : [1,0,1,1,0,0,0,0],
'Dependency' : [0,1,0,0,1,2,2,1],
'Food': ['Steak','Eggs and meal 1','Lamb','Chicken',
'Steak and meal 1 with eggs','Soup and meal 2',
'Water and meal 2 with meal 1','Ham with meal 1']
})
Resulting DataFrame:
ID Meal Solo Dependency Food
0 A 1 1 0 Steak
1 A 2 0 1 Eggs and meal 1
2 A 3 1 0 Lamb
3 B 1 1 0 Chicken
4 B 2 0 1 Steak and meal 1 with eggs
5 B 3 0 2 Soup and meal 2
6 B 4 0 2 Water and meal 2 with meal 1
7 B 5 0 1 Ham with meal 1
I want to create a column with combined meal information:
ID Meal Combined
0 A 1 Steak
1 A 2 Eggs and Steak
2 A 3 Lamb
3 B 1 Chicken
4 B 2 Steak and Chicken with eggs
5 B 3 Soup and Steak and Chicken with eggs
6 B 4 Water and Steak and Chicken with eggs
7 B 5 Ham with Chicken
Any help would be much appreciated.
Thanks!

How do I sum values from one column dependent on items in other columns?

I have the following dataframe:
Course Orders Ingredient 1 Ingredient 2 Ingredient 3
starter 3 Fish Bread Mayonnaise
starter 1 Olives Bread
starter 5 Hummus Pita
main 1 Pizza
main 6 Beef Potato Peas
main 9 Fish Peas
main 11 Bread Mayonnaise Beef
main 4 Pasta Bolognese Peas
desert 10 Cheese Olives Crackers
desert 7 Cookies Cream
desert 8 Cheesecake Cream
I would like to sum the number of orders for each ingredient per course. It is not important which column the ingredient is in.
The following dataframe is what I would like my output to be:
Course Ord Ing1 IngOrd1 Ing2 IngOrd2 Ing3 IngOrd3
starter 3 Fish 3 Bread 4 Mayo 3
starter 1 Olives 1 Bread 4
starter 5 Hummus 5 Pita 5
main 1 Pizza 1
main 6 Beef 17 Potato 6 Peas 21
main 9 Fish 9 Peas 21
main 11 Bread 11 Mayo 11 Beef 17
main 4 Pasta 4 Bolognese 4 Peas 21
desert 10 Cheese 10 Olives 10 Crackers 10
desert 7 Cookies 7 Cream 15
desert 8 Cheesecake 8 Cream 15
I have tried using groupby().sum() but this does not work with the ingredients in 3 columns.
I also cannot use lookup because there are instances in the full dataframe where I do not know what ingredient I am looking for.
I don't believe there's really slick way to this with groupby or other such pandas methods, though I'm happy to be proven wrong. In any case, the following is not especially pretty, but it will give you what you're after.
import pandas as pd
from collections import defaultdict
# The data you provided
df = pd.read_csv('orders.csv')
# Group these labels for convenience
ingredients = ['Ingredient 1', 'Ingredient 2', 'Ingredient 3']
orders = ['IngOrd1', 'IngOrd2', 'IngOrd3']
# Interleave the two lists for final data frame
combined = [y for x in zip(ingredients, orders) for y in x]
# Restructure the data frame so we can group on ingredients
melted = pd.melt(df, id_vars=['Course', 'Orders'], value_vars=ingredients, value_name='Ingredient')
# This is a map that we can apply to each ingredient column to
# look up the correct order count
maps = defaultdict(lambda: defaultdict(int))
# Build the map. Every course/ingredient pair is keyed to the total
# count for that pair, e.g. {(main, beef): 17, ...}
for index, group in melted.groupby(['Course', 'Ingredient']):
course, ingredient = index
maps[course][ingredient] += group.Orders.sum()
# Now apply the map to each ingredient column of the data frame
# to create the new count columns
for i, o in zip(ingredients, orders):
df[o] = df.apply(lambda x: maps[x.Course][x[i]], axis=1)
# Adjust the columns labels
df = df[['Course', 'Orders'] + combined]
print df
Course Orders Ingredient 1 IngOrd1 Ingredient 2 IngOrd2 Ingredient 3 IngOrd3
0 starter 3 Fish 3 Bread 4 Mayonnaise 3
1 starter 1 Olives 1 Bread 4 NaN 0
2 starter 5 Hummus 5 Pita 5 NaN 0
3 main 1 Pizza 1 NaN 0 NaN 0
4 main 6 Beef 17 Potato 6 Peas 19
5 main 9 Fish 9 Peas 19 NaN 0
6 main 11 Bread 11 Mayonnaise 11 Beef 17
7 main 4 Pasta 4 Bolognese 4 Peas 19
8 desert 10 Cheese 10 Olives 10 Crackers 10
9 desert 7 Cookies 7 Cream 15 NaN 0
10 desert 8 Cheesecake 8 Cream 15 NaN 0
You'll need to handle the NaNs and 0 counts if that's an issue. But that's a trivial task.

How to get the cartesian product of a pandas dataframe under certain condition

Given a dataframe:
qid cid title
0 1 a croc
1 2 b dog
2 3 a fish
3 4 b cat
4 5 a bird
I want to get a new dataframe that is the cartesian product of each row with each other row which has the same cid value as it (that is, to get all the pairs of rows with the same cid):
cid1 cid2 qid1 title1 qid2 title2
0 a a 1 croc 3 fish
1 a a 1 croc 5 bird
2 a a 3 fish 5 bird
3 b b 2 dog 4 cat
Suppose my dataset is about 500M, can anybody solve this problem in a comparatively efficient way?
One way to do it s to use a self merge then filter out all the unwanted records.
df.merge(df, on='cid', suffixes=('1','2')).query('qid1 < qid2')
Output:
qid1 cid title1 qid2 title2
1 1 a croc 3 fish
2 1 a croc 5 bird
5 3 a fish 5 bird
10 2 b dog 4 cat

Using Pandas Data Frame how to apply count to multi level grouped columns?

I have a data frame with multiple columns and I want to use count after group by such that it is applied to the combination of 2 or more columns. for example, let's say I have two columns:
user_id product_name
1 Apple
1 Banana
1 Apple
2 Carrot
2 Tomato
2 Carrot
2 Tomato
3 Milk
3 Cucumber
...
What I want to achieve is something like this:
user_id product_name Product_Count_per_User
1 Apple 1
1 Banana 2
2 Carrot 2
2 Tomato 2
3 Milk 1
3 Cucumber 1
I cannot get it. I tried this:
dcf6 = df3.groupby(['user_id','product_name'])['user_id', 'product_name'].count()
but it does not seem to get what I want and it is displaying 4 columns instead of 3. How to do to it? Thanks.
You are counting two columns at the same time, you can just use groupby.size:
(df.groupby(['user_id', 'Product_Name']).size()
.rename('Product_Count_per_User').reset_index())
Or count only one column:
df.groupby(['user_id','Product_Name'])['user_id'].size()
Use GroupBy.size:
dcf6 = df3.groupby(['user_id','Product_Name']).size()
.reset_index(name='Product_Count_per_User')
print (dcf6)
user_id Product_Name Product_Count_per_User
0 1 Apple 2
1 1 Banana 1
2 2 Carrot 2
3 2 Tomato 2
4 3 Cucumber 1
5 3 Milk 1
What is the difference between size and count in pandas?
Base on your own code , just do this .
df.groupby(['user_id','product_name'])['user_id'].
agg({'Product_Count_per_User':'count'}).reset_index(level=1)
product_name Product_Count_per_User
user_id
1 Apple 2
1 Banana 1
2 Carrot 2
2 Tomato 2
3 Cucumber 1
3 Milk 1

Categories