Pandas DataFrame: resampling along integer index / grouping by groups of n elements - python

I know about pandas resampling functions using a DateTimeIndex.
But how can I easily resample/group along an integer index?
The following code illustrates the problem and works:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list('AB'))
print(df)
A B
0 3 2
1 1 1
2 0 1
3 2 3
4 2 0
5 4 0
6 3 1
7 3 4
8 0 2
9 4 4
# sum of n consecutive elements
n = 3
tuples = [(i, i+n-1) for i in range(0, len(df.index), n)]
df_new = pd.concat([df.loc[i[0]:i[1]].sum() for i in tuples], 1).T
print(df_new)
A B
0 4 4
1 8 3
2 6 7
3 4 4
But isn't there a more elegant way to accomplish this?
The code seems a bit heavy-handed to me..
Thanks in advance!

You can floor divide index and aggregate some function:
df1 = df.groupby(df.index // n).sum()
If index is not default (integer, unique) aggregate by floor divided numpy.arange created by len of DataFrame:
df1 = df.groupby(np.arange(len(df)) // n).sum()

You can use group by on the integer division of the index by n. i.e.
df.groupby(lambda i: i//n).sum()
here is the code
import numpy as np
import pandas as pd
n=3
df = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list('AB'))
print('df:')
print(df)
res = df.groupby(lambda i: i//n).sum()
print('using groupby:')
print(res)
tuples = [(i, i+n-1) for i in range(0, len(df.index), n)]
df_new = pd.concat([df.loc[i[0]:i[1]].sum() for i in tuples], 1).T
print('using your method:')
print(df_new)
and the output
df:
A B
0 1 0
1 3 0
2 1 1
3 0 4
4 3 4
5 0 1
6 0 4
7 4 0
8 0 2
9 2 2
using groupby:
A B
0 5 1
1 3 9
2 4 6
3 2 2
using you method:
A B
0 5 1
1 3 9
2 4 6
3 2 2

Related

Comparing the value of a column with the previous value of a new column using Apply in Python (Pandas)

I have a dataframe with these values in column A:
df = pd.DataFrame(A,columns =['A'])
A
0 0
1 5
2 1
3 7
4 0
5 2
6 1
7 3
8 0
I need to create a new column (called B) and populate it using next conditions:
Condition 1: If the value of A is equal to 0 then, the value of B must be 0.
Condition 2: If the value of A is not 0 then I compare its value to the previous value of B. If A is higher than the previous value of B then I take A, otherwise I take B.
The result should be this:
A B
0 0 0
1 5 5
2 1 5
3 7 7
4 0 0
5 2 2
6 1 2
7 3 3
The dataset is huge and using loops would be too slow. I would need to solve this without using loops and the pandas “Loc” function. Anyone could help me to solve this using the Apply function? I have tried different things without success.
Thanks a lot.
One way to do this I guess could be the following
def do_your_stuff(row):
global value
# fancy stuff here
value = row["b"]
[...]
value = df.iloc[0]['B']
df["C"] = df.apply(lambda row: do_your_stuff(row), axis=1)
Try this:
df['B'] = df['A'].shift()
df['B'] = df.apply(lambda x:0 if x.A == 0 else x.A if x.A > x.B else x.B, axis=1)
Use .shift() to shift your one cell down and check if the previous value is smaller and it is not 0. Then use .mask() to replace the values with the previous if the condition stands.
from io import StringIO
import pandas as pd
wt = StringIO("""A
0 0
1 2
2 3
3 1
4 2
5 7
6 0
""")
df = pd.read_csv(wt, sep='\s\s+')
df
A
0 0
1 2
2 3
3 1
4 2
5 7
6 0
def func(df, col):
df['B'] = df[col].mask(cond=((df[col].shift(1) > df[col]) & (df[col] != 0)), other=df[col].shift(1))
if col == 'B':
while ((df[col].shift(1) > df[col]) & (df[col] != 0)).any():
df['B'] = df[col].mask(cond=((df[col].shift(1) > df[col]) & (df[col] != 0)), other=df[col].shift(1))
return df
(df.pipe(func, 'A').pipe(func, 'B'))
Output:
A B
0 0 0
1 2 2
2 3 3
3 1 3
4 2 3
5 7 7
6 0 0
Using the solution of Achille I solved it this way:
import pandas as pd
A = [0,2,3,0,2,7,2,3,2,20,1,0,2,5,4,3,1]
df = pd.DataFrame(A,columns =['A'])
df['B'] = 0
def function(row):
global value
global prev
if row['A'] ==0:
value = 0
elif row['A'] > value:
value = row['A']
else:
value = prev
prev = value
return value
value = df.iloc[0]['B']
prev = value
df["B"] = df.apply(lambda row: function(row), axis=1)
df
output:
A B
0 0 0
1 2 2
2 3 3
3 0 0
4 2 2
5 7 7
6 2 7
7 3 7
8 2 7
9 20 20
10 1 20
11 0 0
12 2 2
13 5 5
14 4 5
15 3 5
16 1 5

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

(Python, DataFrame): Add a Column and insert the nth smallest value in the row

How do I find the nth smallest number in a row, within a DataFrame, and add that value as an entry in a new column (because I would ultimately like to export the data).
Example Data
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.randint(10, size=(4, 5)), columns=list('ABCDE'))
A B C D E
0 4 8 1 1 9
1 2 8 1 4 2
2 8 2 8 4 9
3 4 3 4 1 5
In all of the following solutions, I assume n = 3
Solution 1
function prt below
Use np.partition to place smallest to the left of a partition and the largest to the right. Then take all to the left and find the max.
df.assign(nth=np.partition(df.values, 3, axis=1)[:, :3].max(1))
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Solution 2
function srt below
More intuitive but more costly time complexity with np.sort
df.assign(nth=np.sort(df.values, axis=1)[:, 2])
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Solution 3
function rnk below
Using pd.DataFrame.rank
Concise version that upcast to float
df.assign(nth=df.where(df.rank(1, method='first').eq(3)).stack().values)
A B C D E nth
0 4 8 1 1 9 4.0
1 2 8 1 4 2 2.0
2 8 2 8 4 9 8.0
3 4 3 4 1 5 4.0
Solution 4
function whr below
Using np.where and pd.DataFrame.rank
i, j = np.where(df.rank(1, method='first') == 3)
df.assign(nth=df.values[i, j])
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Timing
Notice that srt is quickest but comparable to prt for a bit, then for larger number of columns, the more efficient algorithm of prt kicks in.
res.plot(loglog=True)
prt = lambda df, n: df.assign(nth=np.partition(df.values, n, axis=1)[:, :n].max(1))
srt = lambda df, n: df.assign(nth=np.sort(df.values, axis=1)[:, n - 1])
rnk = lambda df, n: df.assign(nth=df.where(df.rank(1, method='first').eq(n)).stack().values)
def whr(df, n):
i, j = np.where(df.rank(1, method='first').values == n)
return df.assign(nth=df.values[i, j])
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000],
columns='prt srt rnk whr'.split(),
dtype=float
)
for i in res.index:
num_rows = int(np.log(i))
d = pd.DataFrame(np.random.rand(num_rows, i))
for j in res.columns:
stmt = '{}(d, 3)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
You can do this as follows:
df.assign(nth=df.apply(lambda x: np.partition(x, nth)[nth], axis='columns'))
Example:
In[72]: df = pd.DataFrame(np.random.rand(3, 3), index=list('abc'), columns=[1, 2, 3])
In[73]: df
Out[73]:
1 2 3
a 0.436730 0.653242 0.843014
b 0.643496 0.854859 0.531652
c 0.831672 0.575336 0.517944
In[74]: df.assign(nth=df.apply(lambda x: np.partition(x, 1)[1], axis='columns'))
Out[74]:
1 2 3 nth
a 0.436730 0.653242 0.843014 0.653242
b 0.643496 0.854859 0.531652 0.643496
c 0.831672 0.575336 0.517944 0.575336
Here is a method that finds nth smallest item in a list:
def find_nth_in_list(list, n):
return sorted(list)[n-1]
The usage:
list =[10,5,7,9,8,4,6,2,1,3]
print(find_nth_in_list(list, 2))
Output:
2
You can give the row items as a list to this function.
EDIT
You can find rows with this function:
#Returns all rows as a list
def find_rows(df):
rows=[]
for row in df.iterrows():
index, data = row
rows.append(data.tolist())
return rows
Example usage:
rows = find_rows(df) #all rows as a list
smallest_3th = find_nth_in_list(rows[2], 3) #3rd row, 3rd smallest item
generate some random data
dd=pd.DataFrame(data=np.random.rand(7,3))
find minumum value per row using numpy
dd['minPerRow']=dd.apply(np.min,axis=1)
export results
dd['minPerRow'].to_csv('file.csv')

Get two return values from Pandas apply

I'm trying to return two different values from an apply method but I cant figure out how to get the results I need.
With a function as:
def fun(row):
s = [sum(row[i:i+2]) for i in range (len(row) -1)]
ps = s.index(max(s))
return max(s),ps
and df as:
6:00 6:15 6:30
0 3 8 9
1 60 62 116
I'm trying to return the max value of the row, but i also need to get the index of the first value that produces the max combination.
df["phour"] = t.apply(fun, axis=1)
I can get the output I need, but I don't know how I can get the index in a new column.So far im getting both answer in a tuple
6:00 6:15 6:30 phour
0 3 8 9 (17, 1)
1 60 62 116 (178, 1)
How can I get the index value in its own column?
You can get the index in a separate column like this:
df[['phour','index']] = df.apply(lambda row: pd.Series(list(fun(row))), axis=1)
Or if you modify fun slightly:
def fun(row):
s = [sum(row[i:i+2]) for i in range (len(row) -1)]
ps = s.index(max(s))
return [max(s),ps]
Then the code becomes a little less convoluted:
df[['phour','index']] = df.apply(lambda row: pd.Series(fun(row)), axis=1)
You can apply pd.Series
df.drop('Double', 1).join(df.Double.apply(pd.Series, index=['D1', 'D2']))
A B C D1 D2
0 1 2 3 1 2
1 2 3 2 3 4
2 3 4 4 5 6
3 4 1 1 7 8
Equivalently
df.drop('Double', 1).join(
pd.DataFrame(np.array(df.Double.values.tolist()), columns=['D1', 'D2'])
)
setup
using #GordonBean's df
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1], 'Double': [(1,2), (3,4), (5,6), (7,8)]})
If you are just trying to get the max and argmax, I recommend using the pandas API:
DataFrame.idxmax
So:
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1]})
df
A B C
0 1 2 3
1 2 3 2
2 3 4 4
3 4 1 1
df['Max'] = df.max(axis=1)
df['ArgMax'] = df.idxmax(axis=1)
df
A B C Max ArgMax
0 1 2 3 3 C
1 2 3 2 3 B
2 3 4 4 4 B
3 4 1 1 4 A
Update:
And if you need the actual index value, you can use numpy.ndarray.argmax:
df['ArgMaxNum'] = df[['A','B','C']].values.argmax(axis=1)
A B C Max ArgMax ArgMaxNum
0 1 2 3 3 C 2
1 2 3 2 3 B 1
2 3 4 4 4 B 1
3 4 1 1 4 A 0
One way to split out the tuples into separate columns could be with tuple unpacking:
df = pd.DataFrame({'A':[1,2,3,4], 'B':[2,3,4,1], 'C':[3,2,4,1], 'Double': [(1,2), (3,4), (5,6), (7,8)]})
df
A B C Double
0 1 2 3 (1, 2)
1 2 3 2 (3, 4)
2 3 4 4 (5, 6)
3 4 1 1 (7, 8)
df['D1'] = [d[0] for d in df.Double]
df['D2'] = [d[1] for d in df.Double]
df
A B C Double D1 D2
0 1 2 3 (1, 2) 1 2
1 2 3 2 (3, 4) 3 4
2 3 4 4 (5, 6) 5 6
3 4 1 1 (7, 8) 7 8
There's got to be a better way but you can do:
df.merge(pd.DataFrame(((i,j) for
i,j in df.apply(lambda x: fun(x)).values),
columns=['phour','index']),
left_index=True,right_index=True)

Vectorized calculation of a column's value based on a previous value of the same column?

I have a pandas dataframe with two columns A,B as below.
I want a vectorized solution for creating a new column C where C[i] = C[i-1] - A[i] + B[i].
df = pd.DataFrame(data={'A': [10, 2, 3, 4, 5, 6], 'B': [0, 1, 2, 3, 4, 5]})
>>> df
A B
0 10 0
1 2 1
2 3 2
3 4 3
4 5 4
5 6 5
Here is the solution using for-loops:
df['C'] = df['A']
for i in range(1, len(df)):
df['C'][i] = df['C'][i-1] - df['A'][i] + df['B'][i]
>>> df
A B C
0 10 0 10
1 2 1 9
2 3 2 8
3 4 3 7
4 5 4 6
5 6 5 5
... which does the job.
But since loops are slow in comparison to vectorized calculations, I want a vectorized solution for this in pandas:
I tried to use the shift() method like this:
df['C'] = df['C'].shift(1).fillna(df['A']) - df['A'] + df['B']
but it didn't help since the shifted C column isn't updated with the calculation. It keeps its original values:
>>> df['C'].shift(1).fillna(df['A'])
0 10
1 10
2 2
3 3
4 4
5 5
and that produces a wrong result.
This can be vectorized since:
delta[i] = C[i] - C[i-1] = -A[i] +B[i]. You can get delta from A and B first, then...
calculate cumulative sum of delta (plus C[0]) to get full C
Code as follows:
delta = df['B'] - df['A']
delta[0] = 0
df['C'] = df.loc[0, 'A'] + delta.cumsum()
​
print df
A B C
0 10 0 10
1 2 1 9
2 3 2 8
3 4 3 7
4 5 4 6
5 6 5 5

Categories