Only sum pandas rows Consecutive when column has consecutive number - python

I have a dataframe like
pd.DataFrame({'i': [ 3, 4, 12, 25, 44, 45, 52, 53, 65, 66]
, 't': range(1,11)
, 'v': range(0,100)[::10]}
)
i.e.
i t v
0 3 1 0
1 4 2 10
2 12 3 20
3 25 4 30
4 44 5 40
5 45 6 50
6 52 7 60
7 53 8 70
8 65 9 80
9 66 10 90
I would like to sum the values in column v with the next column if i increased by 1, otherwise do nothing.
One can assume that there are maximally two consecutive rows to sum, thus the last row might be ambiguous, depending if it is summed or not.
The resulting dataframe should look like:
i t v
0 3 1 10
2 12 3 20
3 25 4 30
4 44 5 90
6 52 7 130
8 65 9 170
Obviously I could loop over the dataframe using .iterrows() but there must be a smarter solution.
I tried various combinations of shift, diff and groupby, though I cannot see the way to do it...

It's a common technique to identify the block with cumsum on diff:
blocks = df['i'].diff().ne(1).cumsum()
df.groupby(blocks, as_index=False).agg({'i':'first','t':'first', 'v':'sum'})
Output:
i t v
0 3 1 10
1 12 3 20
2 25 4 30
3 44 5 90
4 52 7 130
5 65 9 170

Let us try
out = df.groupby(df['i'].diff().ne(1).cumsum()).agg({'i':'first','t':'first','v':'sum'})
Out[11]:
i t v
i
1 3 1 10
2 12 3 20
3 25 4 30
4 44 5 90
5 52 7 130
6 65 9 170

Related

Find closest element in list for each row in Pandas DataFrame column

I have a Pandas DataFrame and comparation list like this:
In [21]: df
Out[21]:
Results
0 90
1 80
2 70
3 60
4 50
5 40
6 30
7 20
8 10
In [23]: comparation_list
Out[23]: [83, 72, 65, 40, 36, 22, 15, 12]
Now, I want to create a new column on this df where the value of each row is the closest element of the comparation list to the Results column correspondent row.
The output should be something like this:
Results assigned_value
0 90 83
1 80 83
2 70 72
3 60 65
4 50 40
5 40 40
6 30 36
7 20 22
8 10 12
Doing this through loops or using apply comes straight to my mind, but I would like to know how to do it in a vectorized way.
Use a merge_asof:
out = pd.merge_asof(
df.reset_index().sort_values(by='Results'),
pd.Series(sorted(comparation_list), name='assigned_value'),
left_on='Results', right_on='assigned_value',
direction='nearest'
).set_index('index').sort_index()
Output:
Results assigned_value
index
0 90 83
1 80 83
2 70 72
3 60 65
4 50 40
5 40 40
6 30 36
7 20 22
8 10 12

Deciles in Python

I want to group a column into deciles and assign points out of 50.
The lowest decile receives 5 points and points are increased in 5 point increments.
With below I am able to group my column into deciles. How do I assign points so the lowest decile has 5 points, 2nd lowest has 10 points so on ..and the highest decile has 50 points.
df = pd.DataFrame({'column'[1,2,2,3,4,4,5,6,6,7,7,8,8,9,10,10,10,12,13,14,16,16,16,18,19,20,20,22,24,28]})
df['decile'] = pd.qcut(df['column'], 10, labels = False)```
Try this:
df['points'] = df['decile'].add(1).mul(5)
Output:
column decile points
0 1 0 5
1 2 0 5
2 2 0 5
3 3 1 10
4 4 1 10
5 4 1 10
6 5 2 15
7 6 2 15
8 6 2 15
9 7 3 20
10 7 3 20
11 8 3 20
12 8 3 20
13 9 4 25
14 10 4 25
15 10 4 25
16 10 4 25
17 12 5 30
18 13 6 35
19 14 6 35
20 16 6 35
21 16 6 35
22 16 6 35
23 18 7 40
24 19 8 45
25 20 8 45
26 20 8 45
27 22 9 50
28 24 9 50
29 28 9 50
Simple enough; you can apply operations between columns directly. Deciles are numbered from 0 through 9, so they are naturally ordered. You want increments of 5 points per decile, so multiplying the deciles by 5 will give you that. Since you want to start at 5, you can offset with a simple sum. The following gives you what I believe you want:
df['points'] = df['decile'] * 5 + 5
Here's a way that can easily be generalized to different point systems that are not linear with decile:
df['points'] = df.decile.map({d:5 * (d + 1) for d in range(10)})
This uses Series.map() to map from each decile value to the desired number of points for that decile using a dictionary.

Pandas drop multiple in range value using isin

Given a df
a
0 1
1 2
2 1
3 7
4 10
5 11
6 21
7 22
8 26
9 51
10 56
11 83
12 82
13 85
14 90
I would like to drop rows if the value in column a is not within these multiple range
(10-15),(25-30),(50-55), (80-85). Such that these range are made from the 'lbotandltop`
lbot =[10, 25, 50, 80]
ltop=[15, 30, 55, 85]
I am thinking this can be achieve via pandas isin
df[df['a'].isin(list(zip(lbot,ltop)))]
But, it return empty df instead.
The expected output is
a
10
11
26
51
83
82
85
You can use numpy broadcasting to create a boolean mask where for each row it returns True if the value is within any of the ranges and filter df with it.:
out = df[((df[['a']].to_numpy() >=lbot) & (df[['a']].to_numpy() <=ltop)).any(axis=1)]
Output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Create values in flatten list comprehension with range:
df = df[df['a'].isin([z for x, y in zip(lbot,ltop) for z in range(x, y+1)])]
print (df)
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Or use np.concatenate for flatten list of ranges:
df = df[df['a'].isin(np.concatenate([range(x, y+1) for x, y in zip(lbot,ltop)]))]
A method that uses between():
df[pd.concat([df['a'].between(x, y) for x,y in zip(lbot, ltop)], axis=1).any(axis=1)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
If your values in the two lists are sorted, a method that doesn't require any loop would be to use pandas.cut and checking that you obtain the same group cutting on the two lists:
# group based on lower bound
id1 = pd.cut(df['a'], bins=lbot+[float('inf')], labels=range(len(lbot)),
right=False) # include lower bound
# group based on upper bound
id2 = pd.cut(df['a'], bins=[0]+ltop, labels=range(len(ltop)))
# ensure groups are identical
df[id1.eq(id2)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
intermediate groups:
a id1 id2
0 1 NaN 0
1 2 NaN 0
2 1 NaN 0
3 7 NaN 0
4 10 0 0
5 11 0 0
6 21 0 1
7 22 0 1
8 26 1 1
9 51 2 2
10 56 2 3
11 83 3 3
12 82 3 3
13 85 3 3
14 90 3 NaN

How to randomly drop rows in Pandas dataframe until there are equal number of values in a column?

I have a dataframe pd with two columns, X and y.
In pd[y] I have integers from 1 to 10 inclusive. However they have different frequencies:
df[y].value_counts()
10 6645
9 6213
8 5789
7 4643
6 2532
5 1839
4 1596
3 878
2 815
1 642
I want to cut down my dataframe so that there are equal number of occurrences for each label. As I want an equal number of each label, the minimum frequency is 642. So I only want to keep 642 randomly sampled rows of each class label in my dataframe so that my new dataframe has 642 for each class label.
I thought this might have helped however stratifying only keeps the same percentage of each label but I want all my labels to have the same frequency.
As an example of a dataframe:
df = pd.DataFrame()
df['y'] = sum([[10]*6645, [9]* 6213,[8]* 5789, [7]*4643,[6]* 2532, [5]*1839,[4]* 1596,[3]* 878, [2]*815, [1]* 642],[])
df['X'] = [random.choice(list('abcdef')) for i in range(len(df))]
Use pd.sample with groupby-
df = pd.DataFrame(np.random.randint(1, 11, 100), columns=['y'])
val_cnt = df['y'].value_counts()
min_sample = val_cnt.min()
print(min_sample) # Outputs 7 in as an example
print(df.groupby('y').apply(lambda s: s.sample(min_sample)))
Output
y
y
1 68 1
8 1
82 1
17 1
99 1
31 1
6 1
2 55 2
15 2
81 2
22 2
46 2
13 2
58 2
3 2 3
30 3
84 3
61 3
78 3
24 3
98 3
4 51 4
86 4
52 4
10 4
42 4
80 4
53 4
5 16 5
87 5
... ..
6 26 6
18 6
7 56 7
4 7
60 7
65 7
85 7
37 7
70 7
8 93 8
41 8
28 8
20 8
33 8
64 8
62 8
9 73 9
79 9
9 9
40 9
29 9
57 9
7 9
10 96 10
67 10
47 10
54 10
97 10
71 10
94 10
[70 rows x 1 columns]

Summing values across given range of days difference backwards - Pandas

I have created a days difference column in a pandas dataframe, and I'm looking to add a column that has the sum of a specific value over a given days window backwards
Notice that I can supply a date column for each row if it is needed, but the diff was created as days difference from the first day of the data.
Example
df = pd.DataFrame.from_dict({'diff': [0,0,1,2,2,2,2,10,11,15,18],
'value': [10,11,15,2,5,7,8,9,23,14,15]})
df
Out[12]:
diff value
0 0 10
1 0 11
2 1 15
3 2 2
4 2 5
5 2 7
6 2 8
7 10 9
8 11 23
9 15 14
10 18 15
I want to add 5_days_back_sum column that will sum the past 5 days, including same day so the result would be like this
Out[15]:
5_days_back_sum diff value
0 21 0 10
1 21 0 11
2 36 1 15
3 58 2 2
4 58 2 5
5 58 2 7
6 58 2 8
7 9 10 9
8 32 11 23
9 46 15 14
10 29 18 15
How can I achieve that? Originally I have a date column to create the diff column, if that helps its available
Use custom function with boolean indexing for filtering range with sum:
def f(x):
return df.loc[(df['diff'] >= x - 5) & (df['diff'] <= x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29
Similar solution with between:
def f(x):
return df.loc[df['diff'].between(x - 5, x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29

Categories