Summing rows based on cumsum values - python

I have a data frame like
index  A B C
0   4 7 9
1   2 6 22   6 9 13   7 2 44   8 5 6
I want to create another data frame out of this based on the sum of C column. But the catch here is if the sum of C reached 10 or higher it should create another row. Something like this.
index  A B C
0   6 13 11
1   21 16 11
Any help will be highly appreciable. Is there a robust way to do this, or iterating is my last resort?

There is a non-iterative approach. You'll need a groupby based on C % 11.
# Groupby logic - https://stackoverflow.com/a/45959831/4909087
out = df.groupby((df.C.cumsum() % 10).diff().shift().lt(0).cumsum(), as_index=0).agg('sum')
print(out)
A B C
0 6 13 11
1 21 16 11

The code would look something like this:
import pandas as pd
lista = [4, 7, 10, 11, 7]
listb= [7, 8, 2, 5, 9]
listc = [9, 2, 1, 4, 6]
df = pd.DataFrame({'A': lista, 'B': listb, 'C': listc})
def sumsc(df):
suma=0
sumb=0
sumc=0
list_of_sums = []
for i in range(len(df)):
suma+=df.iloc[i,0]
sumb+=df.iloc[i,1]
sumc+=df.iloc[i,2]
if sumc > 10:
list_of_sums.append([suma, sumb, sumc])
suma=0
sumb=0
sumc=0
return pd.DataFrame(list_of_sums)
sumsc(df)
0 1 2
0 11 15 11
1 28 16 11

Related

Find local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides

Write a python program to find all the local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides
Note
Create a Pandas series from the given input.
Input format:
First line of the input consists of list of integers separated by spaces to from pandas series.
Output format:
Output display the array of indices where peak values present.
Sample testcase
input1
12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
output1
[2 5 10 12]
smapletest cases image
How to solve this problem?
import pandas as pd
a = "12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"
a = [int(x) for x in a.split(" ")]
angles = []
for i in range(len(a)):
if i!=0:
if a[i]>a[i-1]:
angles.append('rise')
else:
angles.append('fall')
else:
angles.append('ignore')
prev=0
prev_val = "none"
counts = []
for s in angles:
if s=="fall" and prev_val=="rise":
prev_val = s
counts.append(1)
else:
prev_val = s
counts.append(0)
peaks_pd = pd.Series(counts).shift(-1).fillna(0).astype(int)
df = pd.DataFrame({
'a':a,
'peaks':peaks_pd
})
peak_vals = list(df[df['peaks']==1]['a'].index)
This could be improved further. Steps I have followed:
First find the angle whether its rising or falling
Look at the index at which it starts falling after rising and call it as peaks
Use:
data = [12, 1, 2, 1.1, 9, 10, 2.1, 5, 7, 8, 9.1, -9, 10.1, 5.1, 15]
s = pd.Series(data)
n = 3 # number of points to be checked before and after
from scipy.signal import argrelextrema
local_max_index = argrelextrema(s.to_frame().to_numpy(), np.greater_equal, order=n)[0].tolist()
print (local_max_index)
[0, 5, 14]
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
local_max_index = s.index[s == s.rolling(n, center=True).max()].tolist()
print (local_max_index)
[2, 5, 10, 12]
EDIT: Solution for processing value in DataFrame:
df = pd.DataFrame({'Input': ["12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"]})
print (df)
Input
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
s = df['Input'].iloc[[0]].str.split().explode().astype(int).reset_index(drop=True)
print (s)
0 12
1 1
2 2
3 1
4 9
5 10
6 2
7 5
8 7
9 8
10 9
11 -9
12 10
13 5
14 15
Name: Input, dtype: int32
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
df['output'] = [local_max_index]
print (df)
Input output
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15 [2, 5, 10, 12]

Collapse multiple timestamp rows into a single one

I have a series like that:
s = pd.DataFrame({'ts': [1, 2, 3, 6, 7, 11, 12, 13]})
s
ts
0 1
1 2
2 3
3 6
4 7
5 11
6 12
7 13
I would like to collapse rows that have difference less than MAX_DIFF (2). That means that the desired output must be:
[{'ts_from': 1, 'ts_to': 3},
{'ts_from': 6, 'ts_to': 7},
{'ts_from': 11, 'ts_to': 13}]
I did some coding:
s['close'] = s.diff().shift(-1)
s['close'] = s[s['close'] > MAX_DIFF].astype('bool')
s['close'].iloc[-1] = True
parts = []
ts_from = None
for _, row in s.iterrows():
if row['close'] is True:
part = {'ts_from': ts_from, 'ts_to': row['ts']}
parts.append(part)
ts_from = None
continue
if not ts_from:
ts_from = row['ts']
This works but does not seem optimal because of iterrows(). I thought about ranks but couldn't figure out how to implement them so as to groupby rank further.
Is there way to optimes algorithm?
You can create groups by checking where the difference is more than your threshold and take a cumsum. Then agg however you'd like, perhaps first and last in this case.
gp = s['ts'].diff().abs().ge(2).cumsum().rename(None)
res = s.groupby(gp).agg(ts_from=('ts', 'first'),
ts_to=('ts', 'last'))
# ts_from ts_to
#0 1 3
#1 6 7
#2 11 13
And if you want the list of dicts then:
res.to_dict('records')
#[{'ts_from': 1, 'ts_to': 3},
# {'ts_from': 6, 'ts_to': 7},
# {'ts_from': 11, 'ts_to': 13}]
For completeness here is how the grouper aligns with the DataFrame:
s['gp'] = gp
print(s)
ts gp
0 1 0 # `1` becomes ts_from for group 0
1 2 0
2 3 0 # `3` becomes ts_to for group 0
3 6 1 # `6` becomes ts_from for group 1
4 7 1 # `7` becomes ts_to for group 1
5 11 2 # `11` becomes ts_from for group 2
6 12 2
7 13 2 # `13` becomes ts_to for group 2

Passing a list of values to remap in Python

I was wondering if there was a way to pass a list of values/corresponding values that I am remapping in my dataframe. I am using a truncated version of my dataset and would rather not have to update the code one by one, there are about 30 different unique QFundMaster variables in my data.
QFundMaster NPS
0 3 1
1 5 2
2 10 3
3 23 9
4 26 1
The code I am using to remap the data is as follows:
df['Fund'] = df['QFundMaster'] \
.map({3: 'Fund1'\
,5: 'Fund2'\
,10: 'Fund3'\
,23: 'Fund4'\
,26: 'Fund5'})
The code works perfectly fine, but was after a way to pass a list of values/new values so I don't have to edit the code one by one and to make it more efficient. Any help in the right direction would be appreciated. Thanks!
print(df.Fund)
0 Fund1
1 Fund2
2 Fund3
3 Fund4
4 Fund5
Name: Fund, dtype: object
If you have the list of old_values and new_values, you could do:
import pandas as pd
data = [[3, 1],
[5, 2],
[10, 3],
[23, 9],
[26, 1]]
df =pd.DataFrame(data=data, columns=['QFundMaster', 'NPS'])
old_values = [3, 5, 10, 23, 26]
new_values = ['Fund1', 'Fund2', 'Fund3', 'Fund4', 'Fund5']
df['Fund'] = df['QFundMaster'].map(dict(zip(old_values, new_values)))
print(df)
Output
QFundMaster NPS Fund
0 3 1 Fund1
1 5 2 Fund2
2 10 3 Fund3
3 23 9 Fund4
4 26 1 Fund5

Python Pandas: Subsetting data frame both by rows and columns?

Data frame has w (week) and y (year) columns.
d = {
'y': [11,11,13,15,15],
'w': [5, 4, 7, 7, 8],
'z': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(d)
In [61]: df
Out[61]:
w y z
0 5 11 1
1 4 11 2
2 7 13 3
3 7 15 4
4 8 15 5
Two questions:
1) How to get from this data frame min/max date as two numbers w and y in a list [w,y] ?
2) How to subset both columns and rows, so all w and y in the resulting data frame are constrained by conditions:
11 <= y <= 15
4 <= w <= 7
To get min/max pairs I need functions:
min_pair() --> [11,4]
max_pair() --> [15,8]
and these to get a data frame subset:
from_to(y1,w1,y2,w2)
from_to(11,4,15,7) -->
should return rf data frame like this:
r = {
'y': [11,13,15],
'w': [4, 7, 7 ],
'z': [2, 3, 4 ]
}
rf = pd.DataFrame(r)
In [62]: rf
Out[62]:
w y z
0 4 11 2
1 7 13 3
2 7 15 4
Are there any standard functions for this?
Update
For subsetting the following worked for me:
df[(df.y <= 15 ) & (df.y >= 11) & (df.w >= 4) & (df.w <= 7)]
a lot of typing though ...
Here are couple of methods
In [176]: df.min().tolist()
Out[176]: [4, 11]
In [177]: df.max().tolist()
Out[177]: [8, 15]
In [178]: df.query('11 <= y <= 15 and 4 <= w <= 7')
Out[178]:
w y
0 5 11
1 4 11
2 7 13
3 7 15

Printing all solutions in the shape of a matrix using \n\

This function returns all possible multiplication from 1 to d. I want to print the solution in the shape of a d×d matrix.
def example(d):
for i in range(1,d+1):
for l in range(1,d+1):
print(i*l)
For d = 5, the expected output should look like:
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
You could add the values in your second for loop to a list, join the list, and finally print it.
def mul(d):
for i in range(1, d+1):
list_to_print = []
for l in range(1, d+1):
list_to_print.append(str(l*i))
print(" ".join(list_to_print))
>>> mul(5)
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
If you want it to be printed in aligned rows and columns, have a read at Pretty print 2D Python list.
EDIT
The above example will work for both Python 3 and Python 2. However, for Python 3 (as #richard has put in the comments), you can use:
def mul(d):
for i in range(1, d+1):
for l in range(1, d+1):
print(i*l, end=" ")
print()
>>> mul(5)
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
Try this:
mm = []
ll = []
def mul(d):
for i in range(1,d+1):
ll = []
for l in range(1,d+1):
# print(i*l),
ll.append((i*l))
mm.append(ll)
mul(5)
for x in mm:
print(x)
[1, 2, 3, 4, 5]
[2, 4, 6, 8, 10]
[3, 6, 9, 12, 15]
[4, 8, 12, 16, 20]
[5, 10, 15, 20, 25]

Categories