How can I iterate over pairs of rows of a Pandas DataFrame?
For example:
content = [(1,2,[1,3]),(3,4,[2,4]),(5,6,[6,9]),(7,8,[9,10])]
df = pd.DataFrame( content, columns=["a","b","interval"])
print df
output:
a b interval
0 1 2 [1, 3]
1 3 4 [2, 4]
2 5 6 [6, 9]
3 7 8 [9, 10]
Now I would like to do something like
for (indx1,row1), (indx2,row2) in df.?
print "row1:\n", row1
print "row2:\n", row2
print "\n"
which should output
row1:
a 1
b 2
interval [1,3]
Name: 0, dtype: int64
row2:
a 3
b 4
interval [2,4]
Name: 1, dtype: int64
row1:
a 3
b 4
interval [2,4]
Name: 1, dtype: int64
row2:
a 5
b 6
interval [6,9]
Name: 2, dtype: int64
row1:
a 5
b 6
interval [6,9]
Name: 2, dtype: int64
row2:
a 7
b 8
interval [9,10]
Name: 3, dtype: int64
Is there a builtin way to achieve this?
I looked at df.groupby(df.index // 2) and df.itertuples but none of these methods seems to do what I want.
Edit:
The overall goal is to get a list of bools indicating whether the intervals in column "interval" overlap. In the above example the list would be
overlaps = [True, False, False]
So one bool for each pair.
shift the dataframe & concat it back to the original using axis=1 so that each interval & the next interval are in the same row
df_merged = pd.concat([df, df.shift(-1).add_prefix('next_')], axis=1)
df_merged
#Out:
a b interval next_a next_b next_interval
0 1 2 [1, 3] 3.0 4.0 [2, 4]
1 3 4 [2, 4] 5.0 6.0 [6, 9]
2 5 6 [6, 9] 7.0 8.0 [9, 10]
3 7 8 [9, 10] NaN NaN NaN
define an intersects function that works with your lists representation & apply on the merged data frame ignoring the last row where the shifted_interval is null
def intersects(left, right):
return left[1] > right[0]
df_merged[:-1].apply(lambda x: intersects(x.interval, x.next_interval), axis=1)
#Out:
0 True
1 False
2 False
dtype: bool
If you want to keep the loop for, using zip and iterrows could be a way
for (indx1,row1),(indx2,row2) in zip(df[:-1].iterrows(),df[1:].iterrows()):
print "row1:\n", row1
print "row2:\n", row2
print "\n"
To access the next row at the same time, start the second iterrow one row after with df[1:].iterrows(). and you get the output the way you want.
row1:
a 1
b 2
Name: 0, dtype: int64
row2:
a 3
b 4
Name: 1, dtype: int64
row1:
a 3
b 4
Name: 1, dtype: int64
row2:
a 5
b 6
Name: 2, dtype: int64
row1:
a 5
b 6
Name: 2, dtype: int64
row2:
a 7
b 8
Name: 3, dtype: int64
But as said #RafaelC, doing for loop might not be the best method for your general problem.
To get the output you've shown use:
for row in df.index[:-1]:
print 'row 1:'
print df.iloc[row].squeeze()
print 'row 2:'
print df.iloc[row+1].squeeze()
print
You could try the iloc indexing.
Exmaple:
for i in range(df.shape[0] - 1):
idx1,idx2=i,i+1
row1,row2=df.iloc[idx1],df.iloc[idx2]
print(row1)
print(row2)
print()
Related
Python version: 3.5.2; Pandas version: 0.23.1
I am noticing unexpected behavior when I groupby using two indices but each row is unique on the first index. The code I am executing on my data frame with column c is:
df.c.groupby(df.index.names).min()
Everything works as expected when the rows are not unique on the first index. To make this clear, I've included two versions below. Edit: Now including three versions!
Version 1: Has the expected output
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [1, 2, 4]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
Input:
c
a b
1 2 3
2 4
4 5 6
Output:
a b
1 2 3
4 5 6
Version 2: Has the unexpected output
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
Input:
c
a b
1 2 3
4 5 6
Output:
a 3
b 6
Expected Output:
a b
1 2 3
4 5 6
Version 3: Has expected output, but not expected with version 2 in mind.
df = pd.DataFrame([[1, 2, 3, 4], [4, 5, 6, 7]], columns=['a', 'b1', 'b2', 'c'])
df = df.set_index(['a','b1','b2']).sort_index()
Input:
c
a b1 b2
1 2 3 4
4 5 6 7
Output:
a b1 b2
1 2 3 4
4 5 6 7
Here is a peek in to what is going on. Take a look at the name of the series that gets getting passed into the "applied" function, f.
In the first case (Expected Results):
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [1, 2, 4]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
def f(x):
print(x)
print('\n')
print(min(x))
print('\n')
return min(x)
df.c.groupby(['a','b']).apply(f)
Output:
a b
1 2 3
2 4
Name: (1, 2), dtype: int64
3
a b
4 5 6
Name: (4, 5), dtype: int64
6
Out[292]:
a b
1 2 3
4 5 6
In the second case (unexpected results), note the name of the series passed in:
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])
df1 = df1.set_index(['a','b']).sort_index()
def f(x):
print(x)
print('\n')
print(min(x))
print('\n')
return min(x)
df1.c.groupby(['a','b']).apply(f)
Output:
a b
1 2 3
Name: a, dtype: int64
3
a b
4 5 6
Name: b, dtype: int64
6
Out[293]:
a 3
b 6
Name: c, dtype: int64
It uses these series to build the resulting dataframe. The naming of the series is the culprit due the nature of the data. Why? Well, we'll have to look into the code for that.
The idiomatic fix for this problem is use this syntax:
df1.groupby(df1.index.names)['c'].min()
Output:
a b
1 2 3
4 5 6
Name: c, dtype: int64
You can use the level argument of groupby:
>>> df
c
a b
1 2 3
4 5 6
>>> df.c.groupby(level=[0,1]).min()
a b
1 2 3
4 5 6
Name: c, dtype: int64
From the docs
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels
This behavior is now changed in pandas. The output now matches the expected output in all cases.
I have the following df :
A C
Date
2015-06-29 196.0 1
2015-09-18 255.0 2
2015-08-24 236.0 3
2014-11-20 39.0 4
2014-10-02 4.0 5
How can I generate a new series that is the sum of all the previous rows of column c ?
This would be the desired output:
D
1
#This second row of value 3 is the sum of first and second row of column c
3
#This third row of value 6 is the sum of first, second and third row
value of column c , and so on
6
10
15
I have tried a loop such as:
for j in range (len(df)):
new_series.iloc[j]+=df['C'].iloc[j]
return new_series
But does not seem to work
IIUC you can use cumsum to perform this:
In [373]:
df['C'].cumsum()
Out[373]:
Date
2015-06-29 1
2015-09-18 3
2015-08-24 6
2014-11-20 10
2014-10-02 15
Name: C, dtype: int64
Numpy alternatives:
In [207]: np.add.accumulate(df['C'])
Out[207]:
2015-06-29 1
2015-09-18 3
2015-08-24 6
2014-11-20 10
2014-10-02 15
Name: C, dtype: int64
In [208]: np.cumsum(df['C'])
Out[208]:
2015-06-29 1
2015-09-18 3
2015-08-24 6
2014-11-20 10
2014-10-02 15
Name: C, dtype: int64
In [209]: df['C'].values.cumsum()
Out[209]: array([ 1, 3, 6, 10, 15], dtype=int64)
I have a list of possible integer numbers:
item_list = [0,1,2,3]
and some of the numbers do not necessarily will appear in my dataframe. For example with:
df = pd.DataFrame({'a': [0, 2, 0, 1, 0, 1, 0]})
executing
df['a'].value_counts()
will yield
0 5
1 2
2 1
Name: a, dtype: int64
but I am interested in all occurrences of all my 'item_list = [0,1,2,3]', so basically, I would like to see something like:
0 5
1 2
2 1
3 0
Name: a, dtype: int64
where the first column is 'item_list'
How to get this result?
You can also use reindex:
df['a'].value_counts().reindex(item_list).fillna(0)
You can convert values to Categorical:
item_list = [0,1,2,3]
df.a = df.a.astype('category', categories=item_list)
print (df['a'].value_counts())
0 5
1 2
2 1
3 0
Name: a, dtype: int64
With reindex and parameter fill_value:
print (df['a'].value_counts().reindex(item_list, fill_value=0))
0 5
1 2
2 1
3 0
Name: a, dtype: int64
I have a dataframe df that has thousands of rows.
For each row I want to apply function func.
As a test, I wanted to run func for only the first row of df. In func() I placed a print statement. I realized that the print statement was run 2 times even though I am slicing df to one row (there is an additional row for columns but those are columns).
When I do the following
df[0:1].apply(func, axis=1, x,y,z)
or
df.iloc[0:1,:].apply(func, axis=1, x,y,z)
The print statement is run 2 times, which means func() was executed twice.
Any idea why this is happening?
The doc clearly says:
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path.
pay attention at different slicing techniques:
In [134]: df
Out[134]:
a b c
0 9 5 4
1 4 7 2
2 1 3 7
3 6 3 2
4 4 5 2
In [135]: df.iloc[0:1]
Out[135]:
a b c
0 9 5 4
In [136]: df.loc[0:1]
Out[136]:
a b c
0 9 5 4
1 4 7 2
with printing:
print one row as Series:
In [139]: df[0:1].apply(lambda r: print(r), axis=1)
a 9
b 5
c 4
Name: 0, dtype: int32
Out[139]:
0 None
dtype: object
or using iloc:
In [144]: df.iloc[0:1, :].apply(lambda r: print(r), axis=1)
a 9
b 5
c 4
Name: 0, dtype: int32
Out[144]:
0 None
dtype: object
print two rows/Series:
In [140]: df.loc[0:1].apply(lambda r: print(r), axis=1)
a 9
b 5
c 4
Name: 0, dtype: int32
a 4
b 7
c 2
Name: 1, dtype: int32
Out[140]:
0 None
1 None
dtype: object
OP:
"the print statement was run 2 times even though I am slicing df to
one row"
actually, you were slicing it into two rows
I tried:
x=pandas.DataFrame(...)
s = x.take([0], axis=1)
And s gets a DataFrame, not a Series.
>>> import pandas as pd
>>> df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
>>> df
x y
0 1 4
1 2 5
2 3 6
3 4 7
>>> s = df.ix[:,0]
>>> type(s)
<class 'pandas.core.series.Series'>
>>>
===========================================================================
UPDATE
If you're reading this after June 2017, ix has been deprecated in pandas 0.20.2, so don't use it. Use loc or iloc instead. See comments and other answers to this question.
From v0.11+, ... use df.iloc.
In [7]: df.iloc[:,0]
Out[7]:
0 1
1 2
2 3
3 4
Name: x, dtype: int64
You can get the first column as a Series by following code:
x[x.columns[0]]
Isn't this the simplest way?
By column name:
In [20]: df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
In [21]: df
Out[21]:
x y
0 1 4
1 2 5
2 3 6
3 4 7
In [23]: df.x
Out[23]:
0 1
1 2
2 3
3 4
Name: x, dtype: int64
In [24]: type(df.x)
Out[24]:
pandas.core.series.Series
This works great when you want to load a series from a csv file
x = pd.read_csv('x.csv', index_col=False, names=['x'],header=None).iloc[:,0]
print(type(x))
print(x.head(10))
<class 'pandas.core.series.Series'>
0 110.96
1 119.40
2 135.89
3 152.32
4 192.91
5 177.20
6 181.16
7 177.30
8 200.13
9 235.41
Name: x, dtype: float64
df[df.columns[i]]
where i is the position/number of the column(starting from 0).
So, i = 0 is for the first column.
You can also get the last column using i = -1