Multiply columns with range between 0 and 1 by 100 in Pandas - python

Given a Pandas dataframe:
df = pd.DataFrame({'A':[1, 2, 3, 4, 5],
'B': [0.1, 0.2, 0.3, 0.4, 0.5],
'C': [11, 12, 13, 14, 15]})
A B C
0 1 0.1 11
1 2 0.2 12
2 3 0.3 13
3 4 0.4 14
4 5 0.5 15
For all of the columns where the range of values is between 0 and 1, I'd like to multiply all values in those columns by a constant (say, 100). I don't know a priori which columns have values between 0 and 1 and there are 100+ columns.
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
I've tried using .min() and .max() and compared them to the desired range to return True/False values for each column.
(df.min() >= 0) & (df.max() <= 1)
A False
B True
C False
but it isn't obvious how to then select the True columns and multiply those values by 100.
Update
I came up with this solution instead
col_names = ((df.min() >= 0) & (df.max() <= 1)).index
df[col_names] = df[col_names] * 100

Something like this?
to_multiply = [col for col in df if 1 >= min(df[col]) >= 0 and 1 >= max(df[col]) >= 0]
df[to_multiply] = df[to_multiply] * 100

We can construct a boolean mask that test if the values in the df are greater than (gt) 0 and less than (lt) 1 and then call np.all and pass axis=0 to generate a boolean mask to filter the columns and then multiply all values in that column by 100:
In [58]:
df[df.columns[np.all(df.gt(0) & df.lt(1),axis=0)]] *= 100
df
Out[58]:
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
Breaking the above down:
In [61]:
df.gt(0) & df.lt(1)
Out[61]:
A B C
0 False True False
1 False True False
2 False True False
3 False True False
4 False True False
In [62]:
np.all(df.gt(0) & df.lt(1),axis=0)
Out[62]:
array([False, True, False], dtype=bool)
In [63]:
df.columns[np.all(df.gt(0) & df.lt(1),axis=0)]
Out[63]:
Index(['B'], dtype='object')

You can update your DataFrame based on your selection criteria:
df.update(df.loc[:, (df.ge(0).all() & df.le(1).all())].mul(100))
>>> df
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
Any column which is greater than or equal to zero and less than or equal to one is multiplied by 100.
Other comparison operators:
.ge (greater than or equal to)
.gt (greater than)
.le (less than or equal to)
.lt (less than)
.eq (equals)

Use .all() to check if all values are within range and if true, multiply them -
In [1877]: paste
for col in df.columns:
if (0<df[col]).all() and (df[col]<1).all():
df[col] = df[col] * 100
## -- End pasted text --
In [1878]: df
Out[1878]:
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15

Related

Pad selection range in Pandas Dataframe?

If I slice a dataframe with something like
>>> df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
>>> df.loc[df['A'] == 1]
# or
>>> df[df['A'] == 1]
A
0 1
4 1
7 1
8 1
how could I pad my selections by a buffer of 1 and get the each of the indices 0, 1, 3, 4, 5, 6, 7, 8, 9? I want to select all rows for which the value in column 'A' is 1, but also a row before or after any such row.
edit I'm hoping to figure out a solution that works for arbitrary pad sizes, rather than just for a pad size of 1.
edit 2 here's another example illustrating what I'm going for
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
and we're looking for pad == 2. In this case I'd be trying to fetch rows 0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16.
you can use shift with bitwise or |
c = df['A'] == 1
df[c|c.shift()|c.shift(-1)]
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
For arbitrary pad sizes, you may try where, interpolate, and notna to create the mask
n = 2
c = df.where(df['A'] == 1)
m = c.interpolate(limit=n, limit_direction='both').notna()
df[m]
Out[61]:
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Here is an approach that allows for multiple pad levels. Use ffill and bfill on the boolean mask (df['A'] == 1), after converting the False values to np.nan:
import numpy as np
pad = 2
df[(df['A'] == 1).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
Here it is in action:
def padsearch(df, column, value, pad):
return df[(df[column] == value).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
# your first example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=1))
# your other example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=2))
Result:
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Granted the command is far less nice, and its a little clunky to be converting the False to and from null. But it's still using all Pandas builtins, so it is fairly quick still.
I found another solution but not nearly as slick as some of the ones already posted.
# setup
df = ...
pad = 2
# determine set of indicies
indices = set(
[
x for x in filter(
lambda x: x>=0,
[
x+y
for x in df[df['A'] == 1].index
for y in range(-pad, pad+1)
]
)
]
)
# fetch rows
df.iloc[[*indices]]

How to create a cumulative sum column in python if column value is greater than other value

I am working now in getting a cumulative sum column using pandas. However, this column most include cumulative sum only if other column value is greater than other column value. Here's an example of my current data:
Index A B C
0 1 20 3
1 10 15 11
2 20 12 25
3 30 18 32
4 40 32 17
5 50 12 4
Then I want to cumsum() column A if column B is greater than C, if not value is zero. Result column D in original df should look like:
Index A B C D
0 1 20 3 1
1 10 15 11 11
2 20 12 25 0
3 30 18 32 0
4 40 32 17 40
5 50 12 4 90
I appreciate any support in advance.
df = pd.DataFrame({'A': {0: 1, 1: 10, 2: 20, 3: 30, 4: 40, 5: 50},
'B': {0: 20, 1: 15, 2: 12, 3: 18, 4: 32, 5: 12},
'C': {0: 3, 1: 11, 2: 25, 3: 32, 4: 17, 5: 4}})
Make a boolean Series for your condition and identify consecutive groups of True or False
b_gt_c = df.B > df.C
groups = b_gt_c.ne(b_gt_c.shift()).cumsum()
In [107]: b_gt_c
Out[107]:
0 True
1 True
2 False
3 False
4 True
5 True
dtype: bool
In [108]: groups
Out[108]:
0 1
1 1
2 2
3 2
4 3
5 3
dtype: int32
Group by those groups; multiply the cumsum of each group by the condition; assign the result to the new df column.
gb = df.groupby(groups)
for k,g in gb:
df.loc[g.index,'D'] = g['A'].cumsum() * b_gt_c[g.index]
In [109]: df
Out[109]:
A B C D
0 1 20 3 1.0
1 10 15 11 11.0
2 20 12 25 0.0
3 30 18 32 0.0
4 40 32 17 40.0
5 50 12 4 90.0
You could skip the for loop as well :
df['G'] = np.where(df.B.gt(df.C), df.A, np.NaN)
group = df.B.gt(df.C).ne(df.B.gt(df.C).shift()).cumsum()
df['G'] = df.groupby(group).G.cumsum().fillna(0)
Identifying consecutive occurrence of values from SO Q&A: Grouping dataframe based on consecutive occurrence of values
There probably is more legant solution, but this also works.
We first create two dummy columns - x and x_shift.
df.x is conditional where we retain values of df.A where df.B > df.C.
df.x_shift is where we shift values one row below and fill na with 0.
In last step we conditionally add df.A and df.x_shift and then drop df.x and df.x_shift
df['x'] = pd.DataFrame(np.where(df.B>df.C, df.A ,0))
df['x_shift'] = df.x.shift(1).fillna(0)
df['D'] = pd.DataFrame(np.where(df.B >df.C, df.A+df.x_shift,0))
df= df.drop(['x','x_shift'], axis=1
While it's a little barbaric you could convert to numpy arrays and then write a simple catch that goes through the 3 arrays and compares values.

Pandas:Calculate mean of a group of n values of each columns of a dataframe

I have a dataframe of the following type:
A B
0 1 2
1 4 5
2 7 8
3 10 11
4 13 14
5 16 17
I want to calculate the mean of the first 3 element of each column and then next 3 elements and so on and then store in a dataframe.
Desired Output-
A B
0 4 5
1 12 14
Using Group By was one of the approach I thought of but I am unable to figure out how to use Group by in this case.
If default RangeIndex then use integer division and pass to groupby:
df = df.groupby(df.index // 3).mean()
print (df)
A B
0 4 5
1 13 14
Detail:
print (df.index // 3)
Int64Index([0, 0, 0, 1, 1, 1], dtype='int64')
General solution with array created by length of DataFrame - working with all index values:
df = df.groupby(np.arange(len(df)) // 3).mean()
Detail:
print (np.arange(len(df)) // 3)
[0 0 0 1 1 1]

Vectorized calculation of a column's value based on a previous value of the same column?

I have a pandas dataframe with two columns A,B as below.
I want a vectorized solution for creating a new column C where C[i] = C[i-1] - A[i] + B[i].
df = pd.DataFrame(data={'A': [10, 2, 3, 4, 5, 6], 'B': [0, 1, 2, 3, 4, 5]})
>>> df
A B
0 10 0
1 2 1
2 3 2
3 4 3
4 5 4
5 6 5
Here is the solution using for-loops:
df['C'] = df['A']
for i in range(1, len(df)):
df['C'][i] = df['C'][i-1] - df['A'][i] + df['B'][i]
>>> df
A B C
0 10 0 10
1 2 1 9
2 3 2 8
3 4 3 7
4 5 4 6
5 6 5 5
... which does the job.
But since loops are slow in comparison to vectorized calculations, I want a vectorized solution for this in pandas:
I tried to use the shift() method like this:
df['C'] = df['C'].shift(1).fillna(df['A']) - df['A'] + df['B']
but it didn't help since the shifted C column isn't updated with the calculation. It keeps its original values:
>>> df['C'].shift(1).fillna(df['A'])
0 10
1 10
2 2
3 3
4 4
5 5
and that produces a wrong result.
This can be vectorized since:
delta[i] = C[i] - C[i-1] = -A[i] +B[i]. You can get delta from A and B first, then...
calculate cumulative sum of delta (plus C[0]) to get full C
Code as follows:
delta = df['B'] - df['A']
delta[0] = 0
df['C'] = df.loc[0, 'A'] + delta.cumsum()
​
print df
A B C
0 10 0 10
1 2 1 9
2 3 2 8
3 4 3 7
4 5 4 6
5 6 5 5

how to filter groupby object in pandas based on difference of values within the group?

I have a dataframe as listed below:
In []: dff = pd.DataFrame({'A': np.arange(8),
'B': list('aabbbbcc'),
'C':np.random.randint(100,size=8)})
which i have grouped based on column B
In []: grouped = dff.groupby('B')
Now, I want to filter the dff based on difference of values in column 'C'. For example, if the difference between any two points within the group in column C is greater than a threshold, remove that row.
If dff is:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
4 4 b 46
5 5 b 56
6 6 c 74
7 7 c 3
Then, a threshold of 10 for C will produce a final table like:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
4 4 b 46
5 5 b 56
here the grouped category c (small letter) is removed as the difference between the two is greater than 10, but category b has all the rows intact as they are all within 10 of each other.
I think I'd do the hard work in numpy:
In [11]: a = np.array([2, 3, 14, 15, 54])
In [12]: res = np.abs(a[:, np.newaxis] - a) < 10 # Note: perhaps you want <= 10.
In [13]: np.fill_diagonal(res, False)
In [14]: res.any(0)
Out[14]: array([ True, True, True, True, False], dtype=bool)
You could wrap this in a function:
In [15]: def has_close(a, n=10):
res = np.abs(a[:, np.newaxis] - a) < n
np.fill_diagonal(res, False)
return res.any(0)
In [16]: g = df.groupby('B', as_index=False)
In [17]: g.C.apply(lambda x: x[has_close(x.C.values)])
Out[17]:
A B C
0 0 a 18
1 1 a 25
2 2 b 56
3 3 b 62
5 5 b 56

Categories