get first and last values in a groupby - python

I have a dataframe df
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']],
['X', 'Y'])
How do I get the first and last rows, grouped by the first level of the index?
I tried
df.groupby(level=0).agg(['first', 'last']).stack()
and got
X Y
a first 0 1
last 6 7
b first 8 9
last 12 13
c first 14 15
last 16 17
d first 18 19
last 18 19
This is so close to what I want. How can I preserve the level 1 index and get this instead:
X Y
a a 0 1
d 6 7
b e 8 9
g 12 13
c h 14 15
i 16 17
d j 18 19
j 18 19

Option 1
def first_last(df):
return df.ix[[0, -1]]
df.groupby(level=0, group_keys=False).apply(first_last)
Option 2 - only works if index is unique
idx = df.index.to_series().groupby(level=0).agg(['first', 'last']).stack()
df.loc[idx]
Option 3 - per notes below, this only makes sense when there are no NAs
I also abused the agg function. The code below works, but is far uglier.
df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
.set_index('level_1', append=True).reset_index(1, drop=True) \
.rename_axis([None, None])
Note
per #unutbu: agg(['first', 'last']) take the firs non-na values.
I interpreted this as, it must then be necessary to run this column by column. Further, forcing index level=1 to align may not even make sense.
Let's include another test
df = pd.DataFrame(np.arange(20).reshape(10, -1),
[list('aaaabbbccd'),
list('abcdefghij')],
list('XY'))
df.loc[tuple('aa'), 'X'] = np.nan
def first_last(df):
return df.ix[[0, -1]]
df.groupby(level=0, group_keys=False).apply(first_last)
df.reset_index(1).groupby(level=0).agg(['first', 'last']).stack() \
.set_index('level_1', append=True).reset_index(1, drop=True) \
.rename_axis([None, None])
Sure enough! This second solution is taking the first valid value in column X. It is now nonsensical to have forced that value to align with the index a.

This could be on of the easy solution.
df.groupby(level = 0, as_index= False).nth([0,-1])
X Y
a a 0 1
d 6 7
b e 8 9
g 12 13
c h 14 15
i 16 17
d j 18 19
Hope this helps. (Y)

Please try this:
For last value: df.groupby('Column_name').nth(-1),
For first value: df.groupby('Column_name').nth(0)

Related

Combine two pandas index slices

How can two pandas.IndexSlice s be combined into one?
Set up of the problem:
import pandas as pd
import numpy as np
idx = pd.IndexSlice
cols = pd.MultiIndex.from_product([['A', 'B', 'C'], ['x', 'y'], ['a', 'b']])
df = pd.DataFrame(np.arange(len(cols)*2).reshape((2, len(cols))), columns=cols)
df:
A B C
x y x y x y
a b a b a b a b a b a b
0 0 1 2 3 4 5 6 7 8 9 10 11
1 12 13 14 15 16 17 18 19 20 21 22 23
How can the two slices idx['A', 'y', :] and idx[['B', 'C'], 'x', :], be combined to show in one dataframe?
Separately they are:
df.loc[:, idx['A', 'y',:]]
A
y
a b
0 2 3
1 14 15
df.loc[:, idx[['B', 'C'], 'x', :]]
B C
x x
a b a b
0 4 5 8 9
1 16 17 20 21
Simply combining them as a list does not play nicely:
df.loc[:, [idx['A', 'y',:], idx[['B', 'C'], 'x',:]]]
....
TypeError: unhashable type: 'slice'
My current solution is incredibly clunky, but gives the sub df that I'm looking for:
df.loc[:, df.loc[:, idx['A', 'y', :]].columns.to_list() + df.loc[:,
idx[['B', 'C'], 'x', :]].columns.to_list()]
A B C
y x x
a b a b a b
0 2 3 4 5 8 9
1 14 15 16 17 20 21
However this doesn't work when one of the slices is just a series (as expected), which is less fun:
df.loc[:, df.loc[:, idx['A', 'y', 'a']].columns.to_list() + df.loc[:,
idx[['B', 'C'], 'x', :]].columns.to_list()]
...
AttributeError: 'Series' object has no attribute 'columns'
Are there any better alternatives to what I'm currently doing that would ideally work with dataframe slices and series slices?
General solution is join together both slice:
a = df.loc[:, idx['A', 'y', 'a']]
b = df.loc[:, idx[['B', 'C'], 'x', :]]
df = pd.concat([a, b], axis=1)
print (df)
A B C
y x x
a a b a b
0 2 4 5 8 9
1 14 16 17 20 21

Rename column with a name from a list

for example i have a list of name:
name_list = ['a', 'b', 'c']
and 3 dataframes:
>> df1
>> k l m
0 12 13 14
1 13 14 15
>> df2
>> o p q
0 10 11 12
1 15 16 17
>> df3
>> r s t
0 1 3 4
1 3 4 5
What i want to do is to replace the first column from each dataframe with a each name from name_list. So, a will replace k, b will replace o and c will replace r.
the output will be:
>> df1
>> a l m
0 12 13 14
1 13 14 15
>> df2
>> b p q
0 10 11 12
1 15 16 17
>> df3
>> c s t
0 1 3 4
1 3 4 5
i can do it manually but would be better if there is best method to do it. Thanks
I totally agree with #ALollz but nevertheless you can try something like
df1 = pd.DataFrame([[1,2,3]], columns=['k', 'l', 'm'])
df2 = pd.DataFrame([[1,2,3]], columns=['o', 'p', 'q'])
df3 = pd.DataFrame([[1,2,3]], columns=['r', 's', 't'])
name_list = ['a', 'b', 'c']
for index, name in enumerate(name_list, 1):
df = pd.eval('df{index}'.format(index=index))
df.rename(
columns = {
df.columns[0]: name,
}, inplace=True)
If you have the dataframes in a list like dfs = [df1, df2, df3] then you can do:
dfs = [dfs[i].rename(columns={dfs[i].columns[0]: name_list[i]}) for i in range(0,len(dfs)]
You can do it in place:
[df.rename(columns={df.columns[0]: c}, inplace=True)
for df,c in zip([df1,df2,df3], ['a', 'b', 'c'])]
Alternatively:
for df,c in zip([df1,df2,df3], ['a', 'b', 'c']):
df.rename(columns={df.columns[0]: c}, inplace=True)

Get last observations from Pandas

Assuming the following dataframe:
variable value
0 A 12
1 A 11
2 B 4
3 A 2
4 B 1
5 B 4
I want to extract the last observation for each variable. In this case, it would give me:
variable value
3 A 2
5 B 4
How would you do this in the most panda/pythonic way?
I'm not worried about performance. Clarity and conciseness is important.
The best way I came up with:
df = pd.DataFrame({'variable': ['A', 'A', 'B', 'A', 'B', 'B'], 'value': [12, 11, 4, 2, 1, 4]})
variables = df['variable'].unique()
new_df = df.drop(index=df.index, axis=1)
for v in variables:
new_df = new_df.append(df[df['variable'] == v].tail(1), inplace=True)
Use drop_duplicates
new_df = df.drop_duplicates('variable',keep='last')
Out[357]:
variable value
3 A 2
5 B 4

how to reorder of rows of a dataframe based on values in a column

I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)

row values from range based on another row in Python

My two sample df are as below.
df1
Column1
1
2
3
4
5
6
7
8
9
10
11
12
13
df2
Column 1 Column2
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
10 J
What I want is merge two dfs on df1, which is quite simple. But if the value is not found in df2 i want to look into a range.
e.g. if NaN then it should further look whether it is between 11 to 13 then it should "C" if its between 14 to 18 it should return "D" and if between 19-25 result should be "E".
You need to use merge and replace the NaNs with fillna().
df1 = pd.DataFrame({'Column1': range(1,26)})
df2 = pd.DataFrame({'Column1': range(1,11),
'Column2': ['A','B','C','D','E','F','G','H','I','J']})
df1 = df1.merge(df2, on=['Column1'], how='left')
fill_dict = {11: 'C', 12: 'C', 13: 'C',
14: 'D', 15: 'D', 16: 'D', 17: 'D', 18: 'D',
19: 'E', 20: 'E', 21: 'E', 22: 'E', 23: 'E', 24: 'E', 25: 'E'}
df1['Column2'] = df1.replace({'Column1':fill_dict})
print(df1)
Output:
Column1 Column2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
10 11 C
11 12 C
12 13 C
13 14 D
14 15 D
15 16 D
16 17 D
17 18 D
18 19 E
19 20 E
20 21 E
21 22 E
22 23 E
23 24 E
24 25 E
EDIT 1:
If you have a range to create the fill_dict dictionary you can use dict.fromkeys()
fill_dict = dict.fromkeys(range(11,14),'C')
fill_dict.update(dict.fromkeys(range(14,19),'D'))
fill_dict.update(dict.fromkeys(range(19,26),'E'))
Or you can also use list comprehension to create the fill_dict dict
fill_dict = dict([(i, 'C') for i in range(11, 14)] +
[(i, 'D') for i in range(14, 19)] +
[(i, 'E') for i in range(19, 26)])
EDIT 2:
Based on our chat conversation, can you please try this:
Instead of creating the dict with range of int's, as your data has float values, I thought of using np.arange() but identifying the correct key with the decimal precision was a bit problematic. So, I thought of writing a function to generate the keys. I am sure this is not efficient in terms of performance. But it gets the job done. There should be some other effective solution for this.
import pandas as pd
import decimal
def gen_float_range(start, stop, step):
while start < stop:
yield float(start)
start += decimal.Decimal(step)
base1 = pd.DataFrame({'HS CODE': [5004.0000,5005.0000,5006.0000,5007.1000,5007.2000,6115.950,6115.950,6115.960,6115.960,6115.950]})
base2 = pd.DataFrame({'HS CODE': [5004.0000,5005.0000,5006.0000,5007.1000,5007.2000],
'%age': 0.4})
base1 = base1.merge(base2, on=['HS CODE'], how='left')
fill_dict = dict.fromkeys(list(gen_float_range(6110,6121,0.0001)),'0.06')
# base1['%age'] = base1.replace({'HS CODE':fill_dict})
base1['%age'] = base1['%age'].fillna(base1['HS CODE'].map(fill_dict))
print(base1)
Output:
HS CODE %age
0 5004.00 0.4
1 5005.00 0.4
2 5006.00 0.4
3 5007.10 0.4
4 5007.20 0.4
5 6115.95 0.06
6 6115.95 0.06
7 6115.96 0.06
8 6115.96 0.06
9 6115.95 0.06
You have to create the fill_dict with the different ranges and append to your fill_dict using the start and stop values and the step should be how you would increment. Based on the data that you shared, I assumed the step will be 0.0001, but this is going to be too much for the dict. You can look at ways for reducing the step to 0.1 or 0.01 based on your requirement.
Merge with left join, and then refill accordingly.
UPDATE:
df1 = df1.merge(df2, on=['Column1'], how='left)
fill_dict = {11: 'A', 12: 'A', ...}
df1['Column1'] = df1['Column1'].fillna(df1['Column2'].apply(fill_dict))

Categories