python pandas select both head and tail - python

For a DataFrame in Pandas, how can I select both the first 5 values and last 5 values?
For example
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
How to show the first two and the last two rows?

You can use iloc with numpy.r_:
print (np.r_[0:2, -2:0])
[ 0 1 -2 -1]
df = df.iloc[np.r_[0:2, -2:0]]
print (df)
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-07 8 8 8
2012-12-08 9 9 9
df = df.iloc[np.r_[0:4, -4:0]]
print (df)
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9

You can use df.head(5) and df.tail(5) to get first five and last five.
Optionally you can create new data frame and append() head and tail:
new_df = df.tail(5)
new_df = new_df.append(df.head(5))

Not quite the same question but if you just want to show the top / bottom 5 rows (eg with display in jupyter or regular print, there's potentially a simpler way than this if you use the pd.option_context context.
#make 100 3d random numbers
df = pd.DataFrame(np.random.randn(100,3))
# sort them by their axis sum
df = df.loc[df.sum(axis=1).index]
with pd.option_context('display.max_rows',10):
print(df)
Outputs:
0 1 2
0 -0.649105 -0.413335 0.374872
1 3.390490 0.552708 -1.723864
2 -0.781308 -0.277342 -0.903127
3 0.433665 -1.125215 -0.290228
4 -2.028750 -0.083870 -0.094274
.. ... ... ...
95 0.443618 -1.473138 1.132161
96 -1.370215 -0.196425 -0.528401
97 1.062717 -0.997204 -1.666953
98 1.303512 0.699318 -0.863577
99 -0.109340 -1.330882 -1.455040
[100 rows x 3 columns]

Small simple function:
def ends(df, x=5):
return df.head(x).append(df.tail(x))
And use like so:
df = pd.DataFrame(np.random.rand(15,6))
ends(df,2)
I actually use this so much, I think it would be a great feature to add to pandas. (No features are to be added to pandas.DataFrame core API) I add it after import like so:
import pandas as pd
def ends(df, x=5):
return df.head(x).append(df.tail(x))
setattr(pd.DataFrame,'ends',ends)
Use like so:
import numpy as np
df = pd.DataFrame(np.random.rand(15,6))
df.ends(2)

You should use both head() and tail() for this purpose. I think the easiest way to do this is:
df.head(5).append(df.tail(5))

In Jupyter, expanding on #bolster's answer, we'll create a reusable convenience function:
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
Then
display_n(df,2)
Returns
0 1 2
0 0.167961 -0.732745 0.952637
1 -0.050742 -0.421239 0.444715
... ... ... ...
98 0.085264 0.982093 -0.509356
99 -0.758963 -0.578267 -0.115865
(except as a nicely formatted HTML table)
when df is df = pd.DataFrame(np.random.randn(100,3))
Notes:
Of course you could make the same thing print as text by modifying display to print above.
On unix-like systems, you can the autoload the above function in all notebooks by placing it in a py or ipy file in ~/.ipython/profile_default/startup as described here.

If you want to keep it to just Pandas, you can use apply() to concatenate the head and tail:
import pandas as pd
from string import ascii_lowercase, ascii_uppercase
df = pd.DataFrame(
{"upper": list(ascii_uppercase), "lower": list(ascii_lowercase)}, index=range(1, 27)
)
df.apply(lambda x: pd.concat([x.head(2), x.tail(2)]))
upper lower
1 A a
2 B b
25 Y y
26 Z z

Associated with Linas Fx.
Defining below
pd.DataFrame.less = lambda df, n=10: df.head(n//2).append(df.tail(n//2))
then you can type only df.less()
It's same as type df.head().append(df.tail())
If you type df.less(2), the result is same as df.head(1).append(df.tail(1))

Combining #ic_fl2 and #watsonic to give the below in Jupyter:
def ends_attr():
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
# set pd.DataFrame attribute where .ends runs display_n() function
setattr(pd.DataFrame,'ends',display_n)
ends_attr()
View first and last 3 rows of your df:
your_df.ends(3)
I like this because I can copy a single function and know I have everything I need to use the ends attribute.

Related

Add label/multi-index on top of columns

Context: I'd like to add a new multi-index/row on top of the columns. For example if I have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
Desired Output: How could I make it so that I can add "Table X" on top of the columns A,B, and C?
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Possible solutions(?): I was thinking about transposing the dataframe, adding the multi-index, and transpose it back again, but not sure how to do that without having to write the dataframe columns manually (I've checked other SO posts about this as well)
Thank you!
In the meantime I've also discovered this solution:
tt = pd.concat([tt],keys=['Table X'], axis=1)
Which also yields the desired output
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
If you want a data frame like you wrote, you need a Multiindex data frame, try this:
import pandas as pd
# you need a nested dict first
dict_nested = {'Table X': {'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]}}
# then you have to reform it
reformed_dict = {}
for outer_key, inner_dict in dict_nested.items():
for inner_key, values in inner_dict.items():
reformed_dict[(outer_key, inner_key)] = values
# last but not least convert it to a multiindex dataframe
multiindex_df = pd.DataFrame(reformed_dict)
print(multiIndex_df)
# >> Table X
# >> A B C
# >> 0 1 4 7
# >> 1 2 5 8
# >> 2 3 6 9
You can use pd.MultiIndex.from_tuples() to set / change the columns of the dataframe with a multi index:
tt.columns = pd.MultiIndex.from_tuples((
('Table X', 'A'), ('Table X', 'B'), ('Table X', 'C')))
Result (tt):
Table X
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Add-on, as those are multi index levels you can later change them:
tt.columns.set_levels(['table_x'],level=0,inplace=True)
tt.columns.set_levels(['a','b','c'],level=1,inplace=True)
table_x
a b c
0 1 4 7
1 2 5 8
2 3 6 9

Simple Vectorized Math on Multilevel Columns by Level=0 as Group

I have this data:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_tuples(list(zip(*[['one', 'one', 'two', 'two'],['foo', 'bar', 'foo', 'bar']])))
df = pd.DataFrame(np.arange(12).reshape((3,4)), columns=index)
one two
foo bar foo bar
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Is there a way to do simple vectorized calculations (like addition) for each level 0 group columns on each of the level 1 columns without having to reference the specific column level pairs like:
df[('one','add')] = df[('one','foo')]+df[('one','bar')]
I'd like to get
one two
foo bar add foo bar add
0 0 1 1 2 3 5
1 4 5 9 6 7 13
2 8 9 17 10 11 21
I fiddled around with it for a bit and here is a one-liner that solves the problem in my opinion. It's fully vectorized and doesn't address specific column names. It also puts the add column in the right place.
df.stack(0).assign(add=df.stack(0).sum(axis=1)).stack(0).unstack(0).T
Unfortunately, because of the property of stack / unstack to do the stacking / unstacking into the innermost level, it needs the cryptic .stack(0).unstack(0) operation. It seems like those two operations should cancel each other out, but they actually shuffle the index levels while preserving order.
Here is the same thing split into 3 lines without assign statement.
df = df.stack(0)
df['add'] = df.sum(axis=1)
df = df.stack(0).unstack(0).T
Use pandas.DataFrame.sum with axis=1 and level=0:
df2 = df.sum(axis=1, level=0)
print(df2)
Output:
one two
0 1 5
1 9 13
2 17 21
You can then add new column names to pandas.concat:
df2.columns = [(c, "add") for c in df2]
df2 = pd.concat([df, df2], 1).sort_index(1)
print(df2)
Output:
one two
add bar foo add bar foo
0 1 1 0 5 3 2
1 9 5 4 13 7 6
2 17 9 8 21 11 10
An alternative solution, here, using the same sum solution, but without pd.concat :
df[("one", "add")] = None
df[("two", "add")] = None
df.iloc[:, -2:] = df.sum(axis=1, level=0).to_numpy()
df.sort_index(1)
one two
add bar foo add bar foo
0 1.0 1 0 5.0 3 2
1 9.0 5 4 13.0 7 6
2 17.0 9 8 21.0 11 10

Pandas: remove old DataFrame from memory after groupby

value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas
df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.
Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows

Modifying DataFrames in loop

Given this data frame:
import pandas as pd
df=pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
I'd like to create 3 new data frames; one from each column.
I can do this one at a time like this:
a=pd.DataFrame(df[['A']])
a
A
0 1
1 2
2 3
But instead of doing this for each column, I'd like to do it in a loop.
Here's what I've tried:
a=b=c=df.copy()
dfs=[a,b,c]
fields=['A','B','C']
for d,f in zip(dfs,fields):
d=pd.DataFrame(d[[f]])
...but when I then print each one, I get the whole original data frame as opposed to just the column of interest.
a
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Update:
My actual data frame will have some columns that I do not need and the columns will not be in any sort of order, so I need to be able to get the columns by name.
Thanks in advance!
A simple list comprehension should be enough.
In [68]: df_list = [df[[x]] for x in df.columns]
Printing out the list, this is what you get:
In [69]: for d in df_list:
...: print(d)
...: print('-' * 5)
...:
A
0 1
1 2
2 3
-----
B
0 4
1 5
2 6
-----
C
0 7
1 8
2 9
-----
Each element in df_list is its own data frame, corresponding to each data frame from the original. Furthermore, you don't even need fields, use df.columns instead.
Or you can try this, instead create copy of df, this method will return the result as single Dataframe, not a list, However, I think save Dataframe into a list is better
dfs=['a','b','c']
fields=['A','B','C']
variables = locals()
for d,f in zip(dfs,fields):
variables["{0}".format(d)] = df[[f]]
a
Out[743]:
A
0 1
1 2
2 3
b
Out[744]:
B
0 4
1 5
2 6
c
Out[745]:
C
0 7
1 8
2 9
You should use loc
a = df.loc[:,0]
and then loop through like
for i in range(df.columns.size):
dfs[i] = df.loc[:, i]

Pandas Extract Number from String

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['1a',np.nan,'10a','100b','0b'],
})
df
A
0 1a
1 NaN
2 10a
3 100b
4 0b
I'd like to extract the numbers from each cell (where they exist).
The desired result is:
A
0 1
1 NaN
2 10
3 100
4 0
I know it can be done with str.extract, but I'm not sure how.
Give it a regex capture group:
df.A.str.extract('(\d+)')
Gives you:
0 1
1 NaN
2 10
3 100
4 0
Name: A, dtype: object
U can replace your column with your result using "assign" function:
df = df.assign(A = lambda x: x['A'].str.extract('(\d+)'))
To answer #Steven G 's question in the comment above, this should work:
df.A.str.extract('(^\d*)')
If you have cases where you have multiple disjoint sets of digits, as in 1a2b3c, in which you would like to extract 123, you can do it with Series.str.replace:
>>> df
A
0 1a
1 b2
2 a1b2
3 1a2b3c
>>> df['A'] = df['A'].str.replace('\D+', '')
0 1
1 2
2 12
3 123
You could also work this around with Series.str.extractall and groupby but I think that this one is easier.
Hope this helps!

Categories