Hi I am working with a pandas.Dataframe like below:
Name Quality
Carrot 50
Potato 34
Raddish 43
Ginger 50
Tomato 43
Cabbage 12
I want to associate a rank to the dataframe. I have successfully been able to sort the dataframe based on the field Quality like below:
Name Quality
Carrot 50
Ginger 50
Raddish 43
Tomato 43
Potato 34
Cabbage 12
Now what I want to do is, add a new column called Position and have the rank at which they exist.
The point is, the same rank can be given to two different elements if their quality is the same.
Sample Output Dataframe:
Name Quality Position
Carrot 50 1
Ginger 50 1
Raddish 43 2
Tomato 43 2
Potato 34 3
Cabbage 12 4
Notice how two elements with same quality have the same position while the below elements have +1 position. Also, the dataframe has avg 10 million records
How can I do this in Pandas.Dataframe?
I Sort my Dataframe like below:
df_sort = dataframe.sort_values(by=attribute, ascending=order)
df_sort.reset_index(drop=True)
You're going to want to use Rank.
There are a few variations to rank. The one you want is Dense. That ensures that ties don't result in halves.
df['Position'] = df.Quality.rank(method='dense', ascending = False).astype(int)
df
Name Quality Position
0 Carrot 50 1
1 Ginger 50 1
2 Raddish 43 2
3 Tomato 43 2
4 Potato 34 3
5 Cabbage 12 4
For demonstration purposes, if you didn't use dense but rather min, your dataframe would look like this:
Name Quality Position
0 Carrot 50 1
1 Ginger 50 1
2 Raddish 43 3
3 Tomato 43 3
4 Potato 34 5
5 Cabbage 12 6
The key here is to use ascending = False
For a pre-sorted dataframe, you can use pandas.factorize:
df['Rank'] = pd.factorize(df['Quality'])[0] + 1
print(df)
Name Quality Rank
0 Carrot 50 1
1 Ginger 50 1
2 Raddish 43 2
3 Tomato 43 2
4 Potato 34 3
5 Cabbage 12 4
Related
I have a dataframe, that looks something like :
Item Type Location Count
1 Dog USA 10
2 Dog UK 20
3 Cat JAPAN 30
4 Cat UK 40
5 Bird CHINA 50
6 Bird SPAIN 60
7 Bird UAE 70
I would like to add "Total" row of the sum "Count" column to the end of each unique "Type" column, Morover : I would like to fill down the "Type" column only, as below :
Item Type Location Count
1 Dog USA 10
2 Dog UK 20
Total Dog 30
3 Cat JAPAN 30
4 Cat UK 40
Total Cat 70
5 Bird CHINA 50
6 Bird SPAIN 60
7 Bird UAE 70
Total Bird 180
What i have tried, which it sums all the "Count" row values :
df.loc["Count"] = df.sum()
First reset the index of the dataframe then group the dataframe on Type and aggregate the column Count using sum and index using max, then assign the Item column whose value is Total. Finally .concat the frame with the original dataframe df and .sort the index to maintain the order.
frame = df.reset_index()\
.groupby('Type', as_index=False)\
.agg({'Count': 'sum', 'index': 'max'})\
.assign(Item='Total').set_index('index')
pd.concat([df, frame]).sort_index(ignore_index=True)
Another approach you might want to try (might be faster than the above one):
def summarize():
for k, g in df.groupby('Type', sort=False):
yield g.append({'Item': 'Total',
'Type': k, 'Location': '',
'Count': g['Count'].sum()}, ignore_index=True)
pd.concat(summarize(), ignore_index=True)
which results:
Item Type Location Count
0 1 Dog USA 10
1 2 Dog UK 20
2 Total Dog 30
3 3 Cat JAPAN 30
4 4 Cat UK 40
5 Total Cat 70
6 5 Bird CHINA 50
7 6 Bird SPAIN 60
8 7 Bird UAE 70
9 Total Bird 180
df.groupby("Type").agg({"Count": "sum"})
This question already has answers here:
Aggregation in Pandas
(2 answers)
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
(9 answers)
Closed 2 years ago.
I would like to convert my df1 to df2. They are mentioned below.
df1:
Item totSaleAmount category
0 apple 10 Fruit
1 orange 50 Fruit
2 apple 20 Fruit
3 carrot 60 Vegetable
4 potato 30 Vegetable
5 coffee 5 Beverage
6 potato 10 Vegetable
7 tea 5 Beverage
8 tea 5 Beverage
9 strawberry 40 Fruit
df2(o/p):
Item totSale count category
0 apple 30 2 fruit
1 orange 50 1 fruit
2 carrot 60 1 vegetable
3 potato 40 2 vegetable
4 coffee 5 1 beverage
5 tea 10 2 beverage
6 strawberry 40 1 fruit
I was able to perform a few actions separately: for example, counts I was able to get using value_counts(), and sum I was able to perform using sum() post filter df with the item. But I am not sure how to put everything together and convert df1 to df2.
I have the following dataframe:
Course Orders Ingredient 1 Ingredient 2 Ingredient 3
starter 3 Fish Bread Mayonnaise
starter 1 Olives Bread
starter 5 Hummus Pita
main 1 Pizza
main 6 Beef Potato Peas
main 9 Fish Peas
main 11 Bread Mayonnaise Beef
main 4 Pasta Bolognese Peas
desert 10 Cheese Olives Crackers
desert 7 Cookies Cream
desert 8 Cheesecake Cream
I would like to sum the number of orders for each ingredient per course. It is not important which column the ingredient is in.
The following dataframe is what I would like my output to be:
Course Ord Ing1 IngOrd1 Ing2 IngOrd2 Ing3 IngOrd3
starter 3 Fish 3 Bread 4 Mayo 3
starter 1 Olives 1 Bread 4
starter 5 Hummus 5 Pita 5
main 1 Pizza 1
main 6 Beef 17 Potato 6 Peas 21
main 9 Fish 9 Peas 21
main 11 Bread 11 Mayo 11 Beef 17
main 4 Pasta 4 Bolognese 4 Peas 21
desert 10 Cheese 10 Olives 10 Crackers 10
desert 7 Cookies 7 Cream 15
desert 8 Cheesecake 8 Cream 15
I have tried using groupby().sum() but this does not work with the ingredients in 3 columns.
I also cannot use lookup because there are instances in the full dataframe where I do not know what ingredient I am looking for.
I don't believe there's really slick way to this with groupby or other such pandas methods, though I'm happy to be proven wrong. In any case, the following is not especially pretty, but it will give you what you're after.
import pandas as pd
from collections import defaultdict
# The data you provided
df = pd.read_csv('orders.csv')
# Group these labels for convenience
ingredients = ['Ingredient 1', 'Ingredient 2', 'Ingredient 3']
orders = ['IngOrd1', 'IngOrd2', 'IngOrd3']
# Interleave the two lists for final data frame
combined = [y for x in zip(ingredients, orders) for y in x]
# Restructure the data frame so we can group on ingredients
melted = pd.melt(df, id_vars=['Course', 'Orders'], value_vars=ingredients, value_name='Ingredient')
# This is a map that we can apply to each ingredient column to
# look up the correct order count
maps = defaultdict(lambda: defaultdict(int))
# Build the map. Every course/ingredient pair is keyed to the total
# count for that pair, e.g. {(main, beef): 17, ...}
for index, group in melted.groupby(['Course', 'Ingredient']):
course, ingredient = index
maps[course][ingredient] += group.Orders.sum()
# Now apply the map to each ingredient column of the data frame
# to create the new count columns
for i, o in zip(ingredients, orders):
df[o] = df.apply(lambda x: maps[x.Course][x[i]], axis=1)
# Adjust the columns labels
df = df[['Course', 'Orders'] + combined]
print df
Course Orders Ingredient 1 IngOrd1 Ingredient 2 IngOrd2 Ingredient 3 IngOrd3
0 starter 3 Fish 3 Bread 4 Mayonnaise 3
1 starter 1 Olives 1 Bread 4 NaN 0
2 starter 5 Hummus 5 Pita 5 NaN 0
3 main 1 Pizza 1 NaN 0 NaN 0
4 main 6 Beef 17 Potato 6 Peas 19
5 main 9 Fish 9 Peas 19 NaN 0
6 main 11 Bread 11 Mayonnaise 11 Beef 17
7 main 4 Pasta 4 Bolognese 4 Peas 19
8 desert 10 Cheese 10 Olives 10 Crackers 10
9 desert 7 Cookies 7 Cream 15 NaN 0
10 desert 8 Cheesecake 8 Cream 15 NaN 0
You'll need to handle the NaNs and 0 counts if that's an issue. But that's a trivial task.
Suppose I have a dataframe called df as follows:
A_column B_column C_column
0 Apple 100 15
1 Banana 80 20
2 Orange 110 10
3 Apple 150 16
4 Apple 90 13
[Q] How to list the index [0,3,4] for Apple in A_column?
You can just pass the row indexes as list to df.iloc
>>> df
A_column B_column C_column
0 Apple 100 15
1 Banana 80 20
2 Orange 110 10
3 Apple 150 16
4 Apple 90 13
>>> df.iloc[[0,3,4]]
A_column B_column C_column
0 Apple 100 15
3 Apple 150 16
4 Apple 90 13
EDIT: seems i misunderstood your questions
So you want to have the list containing the index number of the rows containing 'Apple', you can use df.index[df['A_column']=='Apple'].tolist()
>>> df.index[df['A_column']=='Apple'].tolist()
[0, 3, 4]
I have a DataFrame in "tidy" format (columns are variables, rows are observations) containing time series data for several different conditions. I'd like to normalize the data to the zero-hour time point for each condition.
For example, lets say I fed two different animals two different kinds of meal, then every hour I recorded how much food was left:
In [4]: df
Out[4]:
animal meal time food_left
0 lion meat 0 10
1 lion meat 1 5
2 lion meat 2 2
3 tiger meat 0 5
4 tiger meat 1 3
5 tiger meat 2 2
6 lion vegetable 0 5
7 lion vegetable 1 5
8 lion vegetable 2 5
9 tiger vegetable 0 5
10 tiger vegetable 1 5
11 tiger vegetable 2 5
For each time point, I want to calculate how much food a particular animal has eaten (food_eaten) by subtracting food_left at that time point from food_left at time point zero (for that animal and meal), then store the result in another column, e.g.:
animal meal time food_left food_eaten
0 lion meat 0 10 0
1 lion meat 1 5 5
2 lion meat 2 2 8
3 tiger meat 0 5 0
4 tiger meat 1 3 2
5 tiger meat 2 2 3
6 lion vegetable 0 5 0
7 lion vegetable 1 5 0
8 lion vegetable 2 5 0
9 tiger vegetable 0 5 0
10 tiger vegetable 1 5 0
11 tiger vegetable 2 5 0
I'm struggling to figure out how to apply this transformation in Pandas to produce the final data frame (preferably also in tidy format). Importantly, I need to keep the metadata (animal, meal, etc.).
Preferably I'd like a solution which generalizes to different groupings and transformations; for instance, what if I want to divide the amount the tiger ate at each time point by the amount the lion ate (for a given meal) at that time point, or find out how much less the lion ate of vegetables than meat, and so on.
Things I've tried:
groupby:
In [15]: df2 = df.set_index(['time'])
In [16]: df2.groupby(['animal','meal']).transform(lambda x: x[0] - x)
Out[16]:
food_left
time
0 0
1 5
2 8
0 0
1 2
2 3
0 0
1 0
2 0
0 0
1 0
2 0
Result is correct, but the metadata is lost, and I can't join it back to the original df
If I set_index on ['time', 'animal', 'meal'], then I can't groupby:
In [17]: df2 = df.set_index(['time','animal','meal'])
In [19]: df2.groupby(['animal','meal']).transform(lambda x: x[0] - x)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
... snip ...
KeyError: 'animal'
pivot:
In [21]: data_pivot = df.pivot_table(columns=['animal','meal'],index=['time'],values='food_left')
In [22]: data_norm = data_pivot.rsub(data_pivot.loc[0], axis=1)
In [23]: data_norm
Out[23]:
animal lion tiger
meal meat vegetable meat vegetable
time
0 0 0 0 0
1 5 0 2 0
2 8 0 3 0
This is a bit better and I could probably retrieve the original data with melt or unstack, but it seems inelegant. Is there a better way?
You can create a new column based on the transformed data, as a one-liner, it would be:
df['food_eaten'] = df.set_index(['time']).groupby(['animal', 'meal']).
transform(lambda x: x[0] - x).values
df
animal meal time food_left food_eaten
0 lion meat 0 10 0
1 lion meat 1 5 5
2 lion meat 2 2 8
3 tiger meat 0 5 0
4 tiger meat 1 3 2
5 tiger meat 2 2 3
6 lion vegetable 0 5 0
7 lion vegetable 1 5 0
8 lion vegetable 2 5 0
9 tiger vegetable 0 5 0
10 tiger vegetable 1 5 0
11 tiger vegetable 2 5 0
You want to use groupby and diff:
df['food_eaten'] = -df.groupby(['animal', 'meal'])['food_left'].diff()
Follow that with fillna() if you want zeroes rather than NaN for situations where nothing was eaten. While this doesn't directly generalize you now have the amount of food of each type eaten by each animal in each time interval, so you can do the additional computations on this new field.