Pandas: building a column with self-referencing past values - python

I need to generate a column that starts with an initial value, and then is generated by a function that includes past values of that column. For example
df = pd.DataFrame({'a': [1,1,5,2,7,8,16,16,16]})
df['b'] = 0
df.ix[0, 'b'] = 1
df
a b
0 1 1
1 1 0
2 5 0
3 2 0
4 7 0
5 8 0
6 16 0
7 16 0
8 16 0
Now, I want to generate the rest of the column 'b' by taking the minimum of the previous row and adding two. One solution would be
for i in range(1, len(df)):
df.ix[i, 'b'] = df.ix[i-1, :].min() + 2
Resulting in the desired output
a b
0 1 1
1 1 3
2 5 3
3 2 5
4 7 4
5 8 6
6 16 8
7 16 10
8 16 12
Does pandas have a 'clean' way to do this? Preferably one that would vectorize the computation?

pandas doesn't have a great way to handle general recursive calculations. There may be some trick to vectorize it, but if you can take the dependency, this is relatively painless and very fast with numba.
#numba.njit
def make_b(a):
b = np.zeros_like(a)
b[0] = 1
for i in range(1, len(a)):
b[i] = min(b[i-1], a[i-1]) + 2
return b
df['b'] = make_b(df['a'].values)
df
Out[73]:
a b
0 1 1
1 1 3
2 5 3
3 2 5
4 7 4
5 8 6
6 16 8
7 16 10
8 16 12

Related

pandas get original dataframe after vertical concatenation

Let us take a sample dataframe
df = pd.DataFrame(np.arange(10).reshape((5,2)))
df
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and concatenate the two columns into a single column
temp = pd.concat([df[0], df[1]]).to_frame()
temp
0
0 0
1 2
2 4
3 6
4 8
0 1
1 3
2 5
3 7
4 9
What would be the most efficient way to get the original dataframe i.e df from temp?
The following way using groupby works. But is there any more efficient way (like without groupby-apply, pivot) to do this whole task from concatenation (and then doing some operation) and then reverting back to the original dataframe?
pd.DataFrame(temp.groupby(level=0)[0]
.apply(list)
.to_numpy().tolist())
I think we can do pivot after assign the column value with cumcount
check = temp.assign(c=temp.groupby(level=0).cumcount()).pivot(columns='c',values='0')
Out[66]:
c 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
You can use groupby + cumcount to create a sequential counter per level=0 group then append it to the index of the dataframe and use unstack to reshape:
temp.set_index(temp.groupby(level=0).cumcount(), append=True)[0].unstack()
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
You can try this:
In [1267]: temp['g'] = temp.groupby(level=0)[0].cumcount()
In [1273]: temp.pivot(columns='g', values=0)
Out[1279]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
OR:
In [1281]: temp['g'] = (temp.index == 0).cumsum() - 1
In [1282]: temp.pivot(columns='g', values=0)
Out[1282]:
g 0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
df = pd.DataFrame(np.arange(10).reshape((5,2)))
temp = pd.concat([df[0], df[1]]).to_frame()
duplicated_index = temp.index.duplicated()
pd.concat([temp[~duplicated_index], temp[duplicated_index]], axis=1)
Works for this specific case (as pointed out in the comments, it will fail if you have more than one set of duplicate index values) so I don't think it's a better solution.

Python: Changing values in a DataFrame

I'm new to python and pandas and I need some ideas. Say I have the following DataFrame:
0 1 2 3 4 5
1 5 5 5 5 5
2 5 5 5 5 5
3 5 5 5 5 5
4 5 5 5 5 5
I want to iterate through each row and change the values of specific columns. Say I wanted to change all of the values in columns (2,3,4) to a 3.
This is what I've tried, am I going down the right path?
for row in df.iterrows():
for col in range(2, 4):
df.set_value('row', 'col', 3)
EDIT:
Thanks for the responses. The simple solutions are obvious, but what if I wanted to change the values to this... for example:
0 1 2 3 4 5
1 1 2 3 4 5
2 6 7 8 9 10
3 11 12 13 14 15
4 16 17 18 19 20
If you are using a loop when working with dataframes, you are almost always not on the right track.
For this you can use a vectorized assignment:
df[[2, 3, 4]] = 3
Example:
df = pd.DataFrame({1: [1, 2], 2: [1, 2]})
print(df)
# 1 2
# 0 1 1
# 1 2 2
df[[1, 2]] = 3
print(df)
# 1 2
# 0 3 3
# 1 3 3
you can do this
df.iloc[:,1] = 3 #columns 2
df.iloc[:,2] = 3
df.iloc[:,3] = 3

Summing over a DataFrame with two conditions and multiple values

I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?
here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64

pandas compare and select the smallest number from another dataframe

I have two dataframes.
df1
Out[162]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
df2
Out[194]:
A B
0 a 3
1 b 4
2 c 5
I wish to create a 3rd column in df2 that maps df2['A'] to df1 and find the smallest number in df1 that's greater than the number in df2['B']. For example, for df2['C'].ix[0], it should go to df1['a'] and search for the smallest number that's greater than df2['B'].ix[0], which should be 4.
I had something like df2['C'] = df2['A'].map( df1[df1 > df2['B']].min() ). But this doesn't work as it won't go to df2['B'] search for corresponding rows. Thanks.
Use apply for row-wise methods:
In [54]:
# create our data
import pandas as pd
df1 = pd.DataFrame({'a':list(range(12)), 'b':list(range(12)), 'c':list(range(12))})
df1
Out[54]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
[12 rows x 3 columns]
In [68]:
# create our 2nd dataframe, note I have deliberately used alternate values for column 'B'
df2 = pd.DataFrame({'A':list('abc'), 'B':[3,5,7]})
df2
Out[68]:
A B
0 a 3
1 b 5
2 c 7
[3 rows x 2 columns]
In [69]:
# apply row-wise function, must use axis=1 for row-wise
df2['C'] = df2.apply(lambda row: df1[row['A']].ix[df1[row.A] > row.B].min(), axis=1)
df2
Out[69]:
A B C
0 a 3 4
1 b 5 6
2 c 7 8
[3 rows x 3 columns]
There is some example usage in the pandas docs

python pandas groupby() result

I have the following python pandas data frame:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
df
A B C
0 1 5 1
1 1 5 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 6 1
6 2 6 1
7 3 7 1
8 3 7 1
9 4 6 1
10 4 7 1
11 4 7 1
I would like to have another column storing a value of a sum over C values for fixed (both) A and B. That is, something like:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
I have tried with pandas groupby and it kind of works:
res = {}
for a, group_by_A in df.groupby('A'):
group_by_B = group_by_A.groupby('B', as_index = False)
res[a] = group_by_B['C'].sum()
but I don't know how to 'get' the results from res into df in the orderly fashion. Would be very happy with any advice on this. Thank you.
Here's one way (though it feels this should work in one go with an apply, I can't get it).
In [11]: g = df.groupby(['A', 'B'])
In [12]: df1 = df.set_index(['A', 'B'])
The size groupby function is the one you want, we have to match it to the 'A' and 'B' as the index:
In [13]: df1['D'] = g.size() # unfortunately this doesn't play nice with as_index=False
# Same would work with g['C'].sum()
In [14]: df1.reset_index()
Out[14]:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
You could also do a one liner using transform applied to the groupby:
df['D'] = df.groupby(['A','B'])['C'].transform('sum')
You could also do a one liner using merge as follows:
df = df.merge(pd.DataFrame({'D':df.groupby(['A', 'B'])['C'].size()}), left_on=['A', 'B'], right_index=True)
you can use this method :
columns = ['col1','col2',...]
df.groupby('col')[columns].sum()
if you want you can also use .sort_values(by = 'colx', ascending = True/False) after .sum() to sort the final output by a specific column (colx) and in an ascending or descending order.

Categories