Change column values in pandas based on condition

Change column values in pandas based on condition - python

df:
A
0 219
1 590
2 272
3 945
4 175
5 930
6 662
7 472
8 251
9 130
I am trying to create a new column quantile based on which quantile the value falls in, for example:
if value > 1st quantile : value = 1
if value > 2nd quantile : value = 2
if value > 3rd quantile : value = 3
if value > 4th quantile : value = 4
Code:
f_q = df['A'] .quantile (0.25)
s_q = df['A'] .quantile (0.5)
t_q = df['A'] .quantile (0.75)
fo_q = df['A'] .quantile (1)
index = 0
for i in range(len(test_df)):
value = df.at[index,"A"]
if value > 0 and value <= f_q:
df.at[index,"A"] = 1
elif value > f_q and value <= s_q:
df.at[index,"A"] = 2
elif value > s_q and value <= t_q:
df.at[index,"A"] = 3
elif value > t_q and value <= fo_q:
df.at[index,"A"] = 4
index += 1
The code works fine. But I would like to know if there is a more efficient pandas way of doing this. Any suggestions are helpful.

Yes, using pd.qcut:
>>> pd.qcut(df.A, 4).cat.codes + 1
0 1
1 3
2 2
3 4
4 1
5 4
6 4
7 3
8 2
9 1
dtype: int8
(Gives me exactly the same result your code does.)
You could also call np.unique on the qcut result:
>>> np.unique(pd.qcut(df.A, 4), return_inverse=True)[1] + 1
array([1, 3, 2, 4, 1, 4, 4, 3, 2, 1])
Or, using pd.factorize (note the slight difference in the output):
>>> pd.factorize(pd.qcut(df.A, 4))[0] + 1
array([1, 2, 3, 4, 1, 4, 4, 2, 3, 1])

Related

Conflate sets of pandas columns into a single column

I have a dataframe that looks like this:
n objects id x y Vx Vy id.1 x.1 ... Vx.40 Vy.40 ...
0 41 1 2 3 4 5 17 3 ... 5 6 ...
1 21 1 2 3 4 5 17 3 ... 0 0 ...
2 36 1 2 3 4 5 17 3 ... 0 0 ...
My goal is to conflate the contents of every set of id, x, y, Vx, and Vy columns into a single column.
I.e. the end result should look like this:
n objects object_0 object_1 object_40 ...
0 41 [1,2,3,4,5] [17,3,...] ... [...5,6] ...
1 21 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
2 36 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
I am kind of at a loss as to how to achieve that. My only idea was hardcoding it like
df['object_0'] = df[['id', 'x', 'y', 'Vx', 'Vy']].values.tolist()
df.drop(['id', 'x', 'y', 'Vx', 'Vy'], inplace=True)
for i in range(1,41):
df[f'object_{i}'] = df[[f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}']].values.tolist()
df.drop([f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}'], inplace=True)
but that is not a good option, as the number (and names) of repeating columns varies between dataframes. What is consistent is that the number of objects per row is listed, and every object has the same number of elements (i.e. there are no cases of columns going like id.26, y.26, Vx.26, id.27 Vy.27, id.28...)
I suppose I could find the number of objects via something like
last_obj = max([ int(col.split('.')[-1]) for col in df.columns ])
and then dig out the number and names of cols per object by
[ col.split('.')[0] for col in df.columns if col.split('.')[-1] == last_obj ]
but at that point this all starts seeming a bit too cluttered and hacky.
Is there a cleaner way to do that, one that works irrespective of the number of objects, of columns per object, and (ideally) of column names? Any help would be appreciated!
EDIT:
This does work, but is there a more elegant way of doing it?
last_obj = max([ int(col.split('.')[-1]) for col in df.columns if '.' in col])
obj_col_names = [ col.split('.')[0] for col in df.columns if col.split('.')[-1] == str(last_obj) ]
df['object_0'] = df[obj_col_names].values.tolist()
df.drop(obj_col_names, axis=1, inplace=True)
for i in range(1, last_obj+1):
current_col_set = [ "".join([col, f'.{i}']) for col in obj_col_names ]
df[f'object_{i}'] = df[current_col_set].values.tolist()
df.drop(current_col_set, axis=1, inplace=True)

This solution renames the columns into same-named groups. Then does a groupby on those columns and converts them into lists.
Starting with
n objects id x y Vx Vy id.1 x.1 y.1 Vx.1 Vy.1
0 0 41 1 2 3 4 5 17 3 3 4 5
1 1 21 1 2 3 4 5 17 3 3 4 5
2 2 36 1 2 3 4 5 17 3 3 4 5
Then
nb_cols = df.shape[1]-2
nb_groups = int(df.columns[-1].split('.')[1])+1
cols_per_group = nb_cols // nb_groups
group_cols = np.arange(nb_cols)//cols_per_group
explode_cols = list(np.arange(nb_groups))
pd.concat([df.loc[:,:'objects'].reset_index(drop=True), \
df.loc[:,'id':].set_axis(group_cols, axis=1).groupby(level=0, axis=1) \
.apply(lambda x: x.values).to_frame().T.explode(explode_cols).reset_index(drop=True) \
.rename(columns = lambda x: 'object_' + str(x)) \
], axis=1)
Result
n objects object_0 object_1
0 0 41 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
1 1 21 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
2 2 36 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]

Looking for an iterative loop in python which can add up all column values if certain condition meets

I have a dataframe with columns m, n:
m=[0, 0, 1, 0, 0, 0, 4, 0, 0]
n=[6, 1, 2, 1, 4, 3, 1, 3, 5, 1]
I am looking for an iterative loop that adds up values of column n if the value in column m is non-zero. For example at 3rd place of column m the value is 1 (non-zero) so it should add in the column n from index 0 to 2 i.e. 6+1+2=9. Similarly, at m[6]=4 (non-zero) this implies 1+4+3+1=9 and so on.

Let's say you have a dataframe and you want to sum the elements in each column based on the position of non-zero values in column "m". The following code gives you the output as a dataframe. See the comment in the code if you are just looking for summing the values in column "n":
import pandas as pd
from random import randint
m = [0, 1, 0, 0, 1, 0, 0, 0, 2]
n = [1, 1, 3, 4, 1, 1, 2, 1, 3]
r = [randint(1, 3) for _ in m]
names = ['lev', 'yan', 'coke' , 'coke', 'yan', 'lev', 'lev', 'yan', 'lev']
df = pd.DataFrame({'m': m, 'n': n, 'r': r, 'names': names})
print(f"Input dataframe:\n{df}")
# if you want to iterate over all columns
iter_cols = df.columns.tolist()
iter_cols.remove('m')
# To iterate over an specific column (e.g. 'n') you use iter_cols = ['n']
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.m):
if val != 0:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
print(f"Output dataframe:\n{sum_df}")
Output:
Input dataframe:
m n r names
0 0 1 2 lev
1 1 1 3 yan
2 0 3 1 coke
3 0 4 2 coke
4 1 1 2 yan
5 0 1 3 lev
6 0 2 3 lev
7 0 1 3 yan
8 2 3 2 lev
Output dataframe:
n names r
0 2.0 levyan 5.0
1 8.0 cokecokeyan 5.0
2 7.0 levlevyanlev 11.0
And if you want to iterate over distinct values in names column and sum the values in 'n' column accordingly:
iter_cols = ['n']
distinct_names = set(df.names)
print(distinct_names)
out_dct = {}
for name in distinct_names:
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.names):
if val == name:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
out_dct[name] = sum_df

Cumsum column while skipping rows or setting fixed values on a condition based on the result of the actual cumsum

I'm trying to find a vectorized solution in pandas that is quite common in spreadsheets which is to cumsum while skipping or setting fixed values on a condition based on the result of the actual cumsum. I have the following:
A
1 0
2 -1
3 2
4 3
5 -2
6 -3
7 1
8 -1
9 1
10 -2
11 1
12 2
13 -1
14 -2
What I need is to add a second column with the cumsum of 'A' and if one of these sums gives a positive value replace it with 0 and continue the cumsum using that 0. At the same time if the cumsum gives a negative value that is lower than the lowest value in column A recorded after a 0 in column B I will need to replace it with that lowest value in column A. I know this is quite a problem but is there a vectorized solution for this? Maybe using an auxiliary column. The result should look like this:
A B
1 0 0
2 -1 -1 # -1+0 = -1
3 2 0 # -1 + 2 = 1 but 1>0 so this is 0
4 3 0 # same as previous row
5 -2 -2 # -2+0 = -2
6 -3 -3 # -2-3 = -5 but the lowest value in column A since last 0 is -3 so this is replaced by -3
7 1 -2 # 1-3 = -2
8 -1 -3 # -1-2 = -3
9 1 -2 # -3 + 1 = -2
10 -2 -3 # -2-2 = -4 but the lowest value in column A since last 0 is -3 so this is replaced by -3
11 1 -2 # -3 +1 = -2
12 2 0 # -2+2 = 0
13 -1 -1 # 0-1 = -1
14 -2 -2 # -1-2 = -3 but the lowest value in column A since last cap is -2 so this is -2 instead of -3
For the moment I made this but does not work 100% and again is not really efficient:
df['B'] = 0
df['B'][0] = 0
for x in range(len(df)-1):
A = df['A'][x + 1]
B = df['B'][x] + A
if B >= 0:
df['B'][x+1] = 0
elif B < 0 and A < 0 and B < A:
df['B'][x+1] = A
else:
df['B'][x + 1] = B

Using df['A'].expanding(1).apply(function) I could run own function which first get only one row, next 2 rows, next 3 rows, etc. I doesn't give result from previous calculation and it needs to make all calculations again and again but it doesn't need global
variables and hardcoded df['A']
Doc: Series.expanding
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
def function(values):
#print(values)
#print(type(valuse)
#print(len(values))
result = 0
last_zero = 0
for index, value in enumerate(values):
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(values[last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
#print('result:', result)
return result
df['B'] = df['A'].expanding(1).apply(function)
df['B'] = df['B'].astype(int)
print(df)
Result:
A B
0 0 0
1 -1 -1
2 2 0
3 3 0
4 -2 -2
5 -3 -3
6 1 -2
7 -1 -3
8 1 -2
9 -2 -3
10 1 -2
11 2 0
12 -1 -1
13 -2 -2
The same but with normal apply() - it needs global variables and hardcoded df['A']
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
result = 0
last_zero = 0
index = 0
def function(value):
global result
global last_zero
global index
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(df['A'][last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
index += 1
#print('result:', result)
return result
df['B'] = df['A'].apply(function)
df['B'] = df['B'].astype(int)
print(df)
The same using normal for-loop
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
all_values = []
result = 0
last_zero = 0
for index, value in df['A'].iteritems():
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(df['A'][last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
all_values.append(result)
df['B'] = all_values
print(df)

Compare current column value to different column value by row slices

Assuming a dataframe like this
In [5]: data = pd.DataFrame([[9,4],[5,4],[1,3],[26,7]])
In [6]: data
Out[6]:
0 1
0 9 4
1 5 4
2 1 3
3 26 7
I want to count how many times the values in a rolling window/slice of 2 on column 0 are greater or equal to the value in col 1 (4).
On the first number 4 at col 1, a slice of 2 on column 0 yields 5 and 1, so the output would be 2 since both numbers are greater than 4, then on the second 4 the next slice values on col 0 would be 1 and 26, so the output would be 1 because only 26 is greater than 4 but not 1. I can't use rolling window since iterating through rolling window values is not implemented.
I need something like a slice of the previous n rows and then I can iterate, compare and count how many times any of the values in that slice are above the current row.

I have done this using list instead of doing it in data frame. Check the code below:
list1, list2 = df['0'].values.tolist(), df['1'].values.tolist()
outList = []
for ix in range(len(list1)):
if ix < len(list1) - 2:
if list2[ix] < list1[ix + 1] and list2[ix] < list1[ix + 2]:
outList.append(2)
elif list2[ix] < list1[ix + 1] or list2[ix] < list1[ix + 2]:
outList.append(1)
else:
outList.append(0)
else:
outList.append(0)
df['2_rows_forward_moving_tag'] = pd.Series(outList)
Output:
0 1 2_rows_forward_moving_tag
0 9 4 1
1 5 4 1
2 1 3 0
3 26 7 0

Numpy: recode numeric array to which quintile each element belongs

I have a numeric vector a:
import numpy as np
a = np.random.rand(100)
I wish to get the vector (or any other vector) recoded so that each element is either 0, 1, 2, 3 or 4, according to which a quintile it is in (could be more general for any quantile, like quartile, decile etc.).
This is what I'm doing. There has to be something more elegant, no?
from scipy.stats import percentileofscore
n_quantiles = 5
def get_quantile(i, a, n_quantiles):
if a[i] >= max(a):
return n_quantiles - 1
return int(percentileofscore(a, a[i])/(100/n_quantiles))
a_recoded = np.array([get_quantile(i, a, n_quantiles) for i in range(len(a))])
print(a)
print(a_recoded)
[0.04708996 0.86267278 0.23873192 0.02967989 0.42828385 0.58003015
0.8996666 0.15359369 0.83094778 0.44272398 0.60211289 0.90286434
0.40681163 0.91338397 0.3273745 0.00347029 0.37471307 0.72735901
0.93974808 0.55937197 0.39297097 0.91470761 0.76796271 0.50404401
0.1817242 0.78244809 0.9548256 0.78097562 0.90934337 0.89914752
0.82899983 0.44116683 0.50885813 0.2691431 0.11676798 0.84971927
0.38505195 0.7411976 0.51377242 0.50243197 0.89677377 0.69741088
0.47880953 0.71116534 0.01717348 0.77641096 0.88127268 0.17925502
0.53053573 0.16935597 0.65521692 0.19042794 0.21981197 0.01377195
0.61553814 0.8544525 0.53521604 0.88391848 0.36010949 0.35964882
0.29721931 0.71257335 0.26350287 0.22821314 0.8951419 0.38416004
0.19277649 0.67774468 0.27084229 0.46862229 0.3107887 0.28511048
0.32682302 0.14682896 0.10794566 0.58668243 0.16394183 0.88296862
0.55442047 0.25508233 0.86670299 0.90549872 0.04897676 0.33042884
0.4348465 0.62636481 0.48201213 0.49895892 0.36444648 0.01410316
0.46770595 0.09498391 0.96793139 0.03931124 0.64286295 0.50934846
0.59088907 0.56368594 0.7820928 0.77172038]
[0 4 1 0 2 3 4 0 4 2 3 4 2 4 1 0 1 3 4 2 1 4 3 2 0 3 4 3 4 4 4 2 2 1 0 4 1
3 2 2 4 3 2 3 0 3 4 0 2 0 3 0 1 0 3 4 2 4 1 1 1 3 1 1 4 1 0 3 1 2 1 1 1 0
0 3 0 4 2 1 4 4 0 1 2 3 2 2 1 0 2 0 4 0 3 2 3 2 3 3]
Update: just wanted to say this is so easy in R:
How to get the x which belongs to a quintile?

You could use argpartition. Example:
>>> a = np.random.random(20)
>>> N = len(a)
>>> nq = 5
>>> o = a.argpartition(np.arange(1, nq) * N // nq)
>>> out = np.empty(N, int)
>>> out[o] = np.arange(N) * nq // N
>>> a
array([0.61238649, 0.37168998, 0.4624829 , 0.28554766, 0.00098016,
0.41979328, 0.62275886, 0.4254548 , 0.20380679, 0.762435 ,
0.54054873, 0.68419986, 0.3424479 , 0.54971072, 0.06929464,
0.51059431, 0.68448674, 0.97009023, 0.16780152, 0.17887862])
>>> out
array([3, 1, 2, 1, 0, 2, 3, 2, 1, 4, 3, 4, 1, 3, 0, 2, 4, 4, 0, 0])

Here's one way to do it using pd.cut()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100))
df.columns = ['values']
# Apply the quantiles
gdf = df.groupby(pd.cut(df.loc[:, 'values'], np.arange(0, 1.2, 0.2)))['values'].apply(lambda x: list(x)).to_frame()
# Make use of the automatic indexing to assign quantile numbers
gdf.reset_index(drop=True, inplace=True)
# Re-expand the grouped list of values. Method provided by #Zero at https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows
gdf['values'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('values').reset_index()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change column values in pandas based on condition - python

Related

Conflate sets of pandas columns into a single column

Looking for an iterative loop in python which can add up all column values if certain condition meets

Cumsum column while skipping rows or setting fixed values on a condition based on the result of the actual cumsum

Compare current column value to different column value by row slices

Numpy: recode numeric array to which quintile each element belongs

Categories

Resources