How to maintain order when selecting rows in pandas dataframe? - python

I want to select rows in a particular order given in a list. For example
This dataframe
a=[['car',1],['bike',3],['jewel',2],['tv',5],['phone',6]]
df=pd.DataFrame(a,columns=['items','quantity'])
>>> df
items quantity
0 car 1
1 bike 3
2 jewel 2
3 tv 5
4 phone 6
I want to get the rows with this order ['tv','car','phone'], that is, first row tv and then car and then phone. I tried this method but it doesn't maintain order
arr=['tv','car','phone']
df.loc[df['items'].isin(arr)]
items quantity
0 car 1
3 tv 5
4 phone 6

Here's a non-intrusive solution using Index.get_indexer that doesn't involve setting the index:
df.iloc[pd.Index(df['items']).get_indexer(['tv','car','phone'])]
items quantity
3 tv 5
0 car 1
4 phone 6
Note that if this is going to become a frequent thing (by thing, I mean "indexing" with a list on a column), you're better off turning that column into an index. Bonus points if you sort it.
df2 = df.set_index('items')
df2.loc[['tv','car','phone']]
quantity
items
tv 5
car 1
phone 6

IIUC Categorical
df=df.loc[df['items'].isin(arr)]
df.iloc[pd.Categorical(df['items'],categories=arr,ordered=True).argsort()]
Out[157]:
items quantity
3 tv 5
0 car 1
4 phone 6
Or reindex :Notice only different is this will not save the pervious index and if the original index do matter , you should using Categorical (Mentioned by Andy L, if you have duplicate in items ,reindex will failed )
df.set_index('items').reindex(arr).reset_index()
Out[160]:
items quantity
0 tv 5
1 car 1
2 phone 6
Or loop via the arr
pd.concat([df[df['items']==x] for x in arr])
Out[171]:
items quantity
3 tv 5
0 car 1
4 phone 6

merge to the rescue:
(pd.DataFrame({'items':['tv','car','phone']})
.merge(df, on='items')
)
Output:
items quantity
0 tv 5
1 car 1
2 phone 6

For all items to be chosen existing in input df, here's one with searchsorted and should be good on performance -
In [43]: sidx = df['items'].argsort()
In [44]: df.iloc[sidx[df['items'].searchsorted(['tv','car','phone'],sorter=sidx)]]
Out[44]:
items quantity
3 tv 5
0 car 1
4 phone 6

I would create a dictionary from arr and map it to items and dropna, sort_values
d = dict(zip(arr, range(len(arr))))
Out[684]: {'car': 1, 'phone': 2, 'tv': 0}
df.loc[df['items'].map(d).dropna().sort_values().index]
Out[693]:
items quantity
3 tv 5
0 car 1
4 phone 6

Here is another variety that uses .loc.
# Move items to the index, select, then reset.
df.set_index("items").loc[arr].reset_index()
Or another that doesn't change the index.
df.loc[df.reset_index().set_index("items").loc[arr]["index"]]

Why not:
>>> df.iloc[df.loc[df['items'].isin(arr), 'items'].apply(arr.index).sort_values().index]
items quantity
3 tv 5
0 car 1
4 phone 6
>>>

Why not search for index, filter and re-order:
df['new_order'] = df['items'].apply(lambda x: arr.index(x) if x in arr else -1)
df_new = df[df['new_order']>=0].sort_values('new_order')
items quantity new_order
3 tv 5 0
0 car 1 1
4 phone 6 2

Related

Python - How to filter dataframe to keep Order number with only one product?

I have dataframe that contains Several Columns:
Customerid, OrderNumber, PartNumber,Description,Revenue. Something like this:
CustomerId OrderNumber PartNumber Description Revenue
0 6512345 1 ABC1 KitKat 2
1 6512345 1 ABC2 Chips 3
2 6512345 1 ABC3 Coke 2
3 4213124 2 GFA1 Sprite 2
4 2132142 3 FGA1 Beer 3
5 3242342 4 DCA1 Mentos 4
6 3242342 4 DCA2 Fanta 2
7 8768655 5 ABC1 KitKat 2
What I need is to filter this dataframe to contain only orders where the number of item for that order is one. So in this case those would be order with OrderNumber 2, 3 and 5 since these are the one with single transaction.
So since I have few rows of data and i know that here i need to exclude OrderNumbers 1 and 4 i can use this function:
sample = pd.read_excel('../data/Sample.xlsx')
sample = sample[(sample.OrderNumber != 1) & (sample.OrderNumber != 4)]
sample
Which gives me the output like this:
CustomerId OrderNumber PartNumber Description Revenue
3 4213124 2 GFA1 Sprite 2
4 2132142 3 FGA1 Beer 3
7 8768655 5 ABC1 KitKat 2
Now i have over 100k rows and i need to be able to create funciton that will for each order check how many items is related to that order and then exclude the orders that have more than one item.
How would i do that?
One quick way is to use groupby.filter:
df.groupby('OrderNumber').filter(lambda g: len(g) == 1)
CustomerId OrderNumber PartNumber Description Revenue
3 4213124 2 GFA1 Sprite 2
4 2132142 3 FGA1 Beer 3
7 8768655 5 ABC1 KitKat 2
Or filter by transform('size'):
df[df.CustomerId.groupby(df.OrderNumber).transform('size') == 1]
CustomerId OrderNumber PartNumber Description Revenue
3 4213124 2 GFA1 Sprite 2
4 2132142 3 FGA1 Beer 3
7 8768655 5 ABC1 KitKat 2

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Top 2 products counts per day Pandas

I have dataframe like in the below pic.
First; I want the top 2 products, second I need the top 2 products frequents per day, so I need to group it by days and select the top 2 products from products column, I tried this code but it gives an error.
df.groupby("days", as_index=False)(["products"] == "Follow Up").count()
enter image description here
You need to groupby over both days and products and then use size. Once you have done this you will have all the counts in the df you require.
You will then need to sort both the day and the default 0 column which now contains your counts, this has been created by resetting your index on the initial groupby.
We follow the instructions in Pandas get topmost n records within each group to give your desired result.
A full example:
Setup:
df = pd.DataFrame({'day':[1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3],
'value':['a','a','b','b','b','c','a','a','b','b','b','c','a','a','b','b','b','c']})
df.head(6)
day value
0 1 a
1 1 a
2 1 b
3 1 b
4 1 b
5 1 c
df_counts = df.groupby(['day','values']).size().reset_index().sort_values(['day', 0], ascending = [True, False])
df_top_2 = df_counts.groupby('day').head(2)
df_top_2
day value 0
1 1 b 3
0 1 a 2
4 2 b 3
3 2 a 2
7 3 b 3
6 3 a 2
Of course, you should rename the 0 column to something more reasonable but this is a minimal example.

How to replace dataframe columns on given condition?

I have a Dataframe , in which i have two columns, first_columns and second_columns. first_columns is id and second_columns is room number.
As you can see from pic that, particular id person serve on different room number. Now i want to replace all the second_columns columns with 1 and zero on given condition
1) if particular first_columns column id person does't not serve in 9, 10 and 11, then replace all the room number with 1, if he work then replace all room number with 0.
In above picture, first_columns id 3737 doest not work in room 9,10, and 11. then all the row of 3737 room number will be replace by 1.
I think need groupby with transform for compare by sets, last invert mask by ~ and convert to integers:
df['new'] = ((~df.groupby('first_column')['second_column']
.transform(lambda x: set(x) >=set([9,10,11])))
.astype(int))
print (df)
first_column second_column new
0 3767 2 1
1 3767 4 1
2 3767 6 1
3 6282 2 0
4 6282 9 0
5 6282 10 0
6 6282 11 0
7 10622 0 1
8 13096 7 1
9 13096 10 1
10 13896 11 1

Pandas multi index Dataframe - Select and remove

I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5

Categories