df:
A
0 219
1 590
2 272
3 945
4 175
5 930
6 662
7 472
8 251
9 130
I am trying to create a new column quantile based on which quantile the value falls in, for example:
if value > 1st quantile : value = 1
if value > 2nd quantile : value = 2
if value > 3rd quantile : value = 3
if value > 4th quantile : value = 4
Code:
f_q = df['A'] .quantile (0.25)
s_q = df['A'] .quantile (0.5)
t_q = df['A'] .quantile (0.75)
fo_q = df['A'] .quantile (1)
index = 0
for i in range(len(test_df)):
value = df.at[index,"A"]
if value > 0 and value <= f_q:
df.at[index,"A"] = 1
elif value > f_q and value <= s_q:
df.at[index,"A"] = 2
elif value > s_q and value <= t_q:
df.at[index,"A"] = 3
elif value > t_q and value <= fo_q:
df.at[index,"A"] = 4
index += 1
The code works fine. But I would like to know if there is a more efficient pandas way of doing this. Any suggestions are helpful.
Yes, using pd.qcut:
>>> pd.qcut(df.A, 4).cat.codes + 1
0 1
1 3
2 2
3 4
4 1
5 4
6 4
7 3
8 2
9 1
dtype: int8
(Gives me exactly the same result your code does.)
You could also call np.unique on the qcut result:
>>> np.unique(pd.qcut(df.A, 4), return_inverse=True)[1] + 1
array([1, 3, 2, 4, 1, 4, 4, 3, 2, 1])
Or, using pd.factorize (note the slight difference in the output):
>>> pd.factorize(pd.qcut(df.A, 4))[0] + 1
array([1, 2, 3, 4, 1, 4, 4, 2, 3, 1])
Related
I have a dataframe that looks like this:
n objects id x y Vx Vy id.1 x.1 ... Vx.40 Vy.40 ...
0 41 1 2 3 4 5 17 3 ... 5 6 ...
1 21 1 2 3 4 5 17 3 ... 0 0 ...
2 36 1 2 3 4 5 17 3 ... 0 0 ...
My goal is to conflate the contents of every set of id, x, y, Vx, and Vy columns into a single column.
I.e. the end result should look like this:
n objects object_0 object_1 object_40 ...
0 41 [1,2,3,4,5] [17,3,...] ... [...5,6] ...
1 21 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
2 36 [1,2,3,4,5] [17,3,...] ... [...0,0] ...
I am kind of at a loss as to how to achieve that. My only idea was hardcoding it like
df['object_0'] = df[['id', 'x', 'y', 'Vx', 'Vy']].values.tolist()
df.drop(['id', 'x', 'y', 'Vx', 'Vy'], inplace=True)
for i in range(1,41):
df[f'object_{i}'] = df[[f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}']].values.tolist()
df.drop([f'id.{i}', f'x.{i}', f'y.{i}', f'Vx.{i}', f'Vy.{i}'], inplace=True)
but that is not a good option, as the number (and names) of repeating columns varies between dataframes. What is consistent is that the number of objects per row is listed, and every object has the same number of elements (i.e. there are no cases of columns going like id.26, y.26, Vx.26, id.27 Vy.27, id.28...)
I suppose I could find the number of objects via something like
last_obj = max([ int(col.split('.')[-1]) for col in df.columns ])
and then dig out the number and names of cols per object by
[ col.split('.')[0] for col in df.columns if col.split('.')[-1] == last_obj ]
but at that point this all starts seeming a bit too cluttered and hacky.
Is there a cleaner way to do that, one that works irrespective of the number of objects, of columns per object, and (ideally) of column names? Any help would be appreciated!
EDIT:
This does work, but is there a more elegant way of doing it?
last_obj = max([ int(col.split('.')[-1]) for col in df.columns if '.' in col])
obj_col_names = [ col.split('.')[0] for col in df.columns if col.split('.')[-1] == str(last_obj) ]
df['object_0'] = df[obj_col_names].values.tolist()
df.drop(obj_col_names, axis=1, inplace=True)
for i in range(1, last_obj+1):
current_col_set = [ "".join([col, f'.{i}']) for col in obj_col_names ]
df[f'object_{i}'] = df[current_col_set].values.tolist()
df.drop(current_col_set, axis=1, inplace=True)
This solution renames the columns into same-named groups. Then does a groupby on those columns and converts them into lists.
Starting with
n objects id x y Vx Vy id.1 x.1 y.1 Vx.1 Vy.1
0 0 41 1 2 3 4 5 17 3 3 4 5
1 1 21 1 2 3 4 5 17 3 3 4 5
2 2 36 1 2 3 4 5 17 3 3 4 5
Then
nb_cols = df.shape[1]-2
nb_groups = int(df.columns[-1].split('.')[1])+1
cols_per_group = nb_cols // nb_groups
group_cols = np.arange(nb_cols)//cols_per_group
explode_cols = list(np.arange(nb_groups))
pd.concat([df.loc[:,:'objects'].reset_index(drop=True), \
df.loc[:,'id':].set_axis(group_cols, axis=1).groupby(level=0, axis=1) \
.apply(lambda x: x.values).to_frame().T.explode(explode_cols).reset_index(drop=True) \
.rename(columns = lambda x: 'object_' + str(x)) \
], axis=1)
Result
n objects object_0 object_1
0 0 41 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
1 1 21 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
2 2 36 [1, 2, 3, 4, 5] [17, 3, 3, 4, 5]
I have a dataframe with columns m, n:
m=[0, 0, 1, 0, 0, 0, 4, 0, 0]
n=[6, 1, 2, 1, 4, 3, 1, 3, 5, 1]
I am looking for an iterative loop that adds up values of column n if the value in column m is non-zero. For example at 3rd place of column m the value is 1 (non-zero) so it should add in the column n from index 0 to 2 i.e. 6+1+2=9. Similarly, at m[6]=4 (non-zero) this implies 1+4+3+1=9 and so on.
Let's say you have a dataframe and you want to sum the elements in each column based on the position of non-zero values in column "m". The following code gives you the output as a dataframe. See the comment in the code if you are just looking for summing the values in column "n":
import pandas as pd
from random import randint
m = [0, 1, 0, 0, 1, 0, 0, 0, 2]
n = [1, 1, 3, 4, 1, 1, 2, 1, 3]
r = [randint(1, 3) for _ in m]
names = ['lev', 'yan', 'coke' , 'coke', 'yan', 'lev', 'lev', 'yan', 'lev']
df = pd.DataFrame({'m': m, 'n': n, 'r': r, 'names': names})
print(f"Input dataframe:\n{df}")
# if you want to iterate over all columns
iter_cols = df.columns.tolist()
iter_cols.remove('m')
# To iterate over an specific column (e.g. 'n') you use iter_cols = ['n']
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.m):
if val != 0:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
print(f"Output dataframe:\n{sum_df}")
Output:
Input dataframe:
m n r names
0 0 1 2 lev
1 1 1 3 yan
2 0 3 1 coke
3 0 4 2 coke
4 1 1 2 yan
5 0 1 3 lev
6 0 2 3 lev
7 0 1 3 yan
8 2 3 2 lev
Output dataframe:
n names r
0 2.0 levyan 5.0
1 8.0 cokecokeyan 5.0
2 7.0 levlevyanlev 11.0
And if you want to iterate over distinct values in names column and sum the values in 'n' column accordingly:
iter_cols = ['n']
distinct_names = set(df.names)
print(distinct_names)
out_dct = {}
for name in distinct_names:
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.names):
if val == name:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
out_dct[name] = sum_df
I'm trying to find a vectorized solution in pandas that is quite common in spreadsheets which is to cumsum while skipping or setting fixed values on a condition based on the result of the actual cumsum. I have the following:
A
1 0
2 -1
3 2
4 3
5 -2
6 -3
7 1
8 -1
9 1
10 -2
11 1
12 2
13 -1
14 -2
What I need is to add a second column with the cumsum of 'A' and if one of these sums gives a positive value replace it with 0 and continue the cumsum using that 0. At the same time if the cumsum gives a negative value that is lower than the lowest value in column A recorded after a 0 in column B I will need to replace it with that lowest value in column A. I know this is quite a problem but is there a vectorized solution for this? Maybe using an auxiliary column. The result should look like this:
A B
1 0 0
2 -1 -1 # -1+0 = -1
3 2 0 # -1 + 2 = 1 but 1>0 so this is 0
4 3 0 # same as previous row
5 -2 -2 # -2+0 = -2
6 -3 -3 # -2-3 = -5 but the lowest value in column A since last 0 is -3 so this is replaced by -3
7 1 -2 # 1-3 = -2
8 -1 -3 # -1-2 = -3
9 1 -2 # -3 + 1 = -2
10 -2 -3 # -2-2 = -4 but the lowest value in column A since last 0 is -3 so this is replaced by -3
11 1 -2 # -3 +1 = -2
12 2 0 # -2+2 = 0
13 -1 -1 # 0-1 = -1
14 -2 -2 # -1-2 = -3 but the lowest value in column A since last cap is -2 so this is -2 instead of -3
For the moment I made this but does not work 100% and again is not really efficient:
df['B'] = 0
df['B'][0] = 0
for x in range(len(df)-1):
A = df['A'][x + 1]
B = df['B'][x] + A
if B >= 0:
df['B'][x+1] = 0
elif B < 0 and A < 0 and B < A:
df['B'][x+1] = A
else:
df['B'][x + 1] = B
Using df['A'].expanding(1).apply(function) I could run own function which first get only one row, next 2 rows, next 3 rows, etc. I doesn't give result from previous calculation and it needs to make all calculations again and again but it doesn't need global
variables and hardcoded df['A']
Doc: Series.expanding
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
def function(values):
#print(values)
#print(type(valuse)
#print(len(values))
result = 0
last_zero = 0
for index, value in enumerate(values):
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(values[last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
#print('result:', result)
return result
df['B'] = df['A'].expanding(1).apply(function)
df['B'] = df['B'].astype(int)
print(df)
Result:
A B
0 0 0
1 -1 -1
2 2 0
3 3 0
4 -2 -2
5 -3 -3
6 1 -2
7 -1 -3
8 1 -2
9 -2 -3
10 1 -2
11 2 0
12 -1 -1
13 -2 -2
The same but with normal apply() - it needs global variables and hardcoded df['A']
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
result = 0
last_zero = 0
index = 0
def function(value):
global result
global last_zero
global index
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(df['A'][last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
index += 1
#print('result:', result)
return result
df['B'] = df['A'].apply(function)
df['B'] = df['B'].astype(int)
print(df)
The same using normal for-loop
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
all_values = []
result = 0
last_zero = 0
for index, value in df['A'].iteritems():
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(df['A'][last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
all_values.append(result)
df['B'] = all_values
print(df)
Assuming a dataframe like this
In [5]: data = pd.DataFrame([[9,4],[5,4],[1,3],[26,7]])
In [6]: data
Out[6]:
0 1
0 9 4
1 5 4
2 1 3
3 26 7
I want to count how many times the values in a rolling window/slice of 2 on column 0 are greater or equal to the value in col 1 (4).
On the first number 4 at col 1, a slice of 2 on column 0 yields 5 and 1, so the output would be 2 since both numbers are greater than 4, then on the second 4 the next slice values on col 0 would be 1 and 26, so the output would be 1 because only 26 is greater than 4 but not 1. I can't use rolling window since iterating through rolling window values is not implemented.
I need something like a slice of the previous n rows and then I can iterate, compare and count how many times any of the values in that slice are above the current row.
I have done this using list instead of doing it in data frame. Check the code below:
list1, list2 = df['0'].values.tolist(), df['1'].values.tolist()
outList = []
for ix in range(len(list1)):
if ix < len(list1) - 2:
if list2[ix] < list1[ix + 1] and list2[ix] < list1[ix + 2]:
outList.append(2)
elif list2[ix] < list1[ix + 1] or list2[ix] < list1[ix + 2]:
outList.append(1)
else:
outList.append(0)
else:
outList.append(0)
df['2_rows_forward_moving_tag'] = pd.Series(outList)
Output:
0 1 2_rows_forward_moving_tag
0 9 4 1
1 5 4 1
2 1 3 0
3 26 7 0
I have a numeric vector a:
import numpy as np
a = np.random.rand(100)
I wish to get the vector (or any other vector) recoded so that each element is either 0, 1, 2, 3 or 4, according to which a quintile it is in (could be more general for any quantile, like quartile, decile etc.).
This is what I'm doing. There has to be something more elegant, no?
from scipy.stats import percentileofscore
n_quantiles = 5
def get_quantile(i, a, n_quantiles):
if a[i] >= max(a):
return n_quantiles - 1
return int(percentileofscore(a, a[i])/(100/n_quantiles))
a_recoded = np.array([get_quantile(i, a, n_quantiles) for i in range(len(a))])
print(a)
print(a_recoded)
[0.04708996 0.86267278 0.23873192 0.02967989 0.42828385 0.58003015
0.8996666 0.15359369 0.83094778 0.44272398 0.60211289 0.90286434
0.40681163 0.91338397 0.3273745 0.00347029 0.37471307 0.72735901
0.93974808 0.55937197 0.39297097 0.91470761 0.76796271 0.50404401
0.1817242 0.78244809 0.9548256 0.78097562 0.90934337 0.89914752
0.82899983 0.44116683 0.50885813 0.2691431 0.11676798 0.84971927
0.38505195 0.7411976 0.51377242 0.50243197 0.89677377 0.69741088
0.47880953 0.71116534 0.01717348 0.77641096 0.88127268 0.17925502
0.53053573 0.16935597 0.65521692 0.19042794 0.21981197 0.01377195
0.61553814 0.8544525 0.53521604 0.88391848 0.36010949 0.35964882
0.29721931 0.71257335 0.26350287 0.22821314 0.8951419 0.38416004
0.19277649 0.67774468 0.27084229 0.46862229 0.3107887 0.28511048
0.32682302 0.14682896 0.10794566 0.58668243 0.16394183 0.88296862
0.55442047 0.25508233 0.86670299 0.90549872 0.04897676 0.33042884
0.4348465 0.62636481 0.48201213 0.49895892 0.36444648 0.01410316
0.46770595 0.09498391 0.96793139 0.03931124 0.64286295 0.50934846
0.59088907 0.56368594 0.7820928 0.77172038]
[0 4 1 0 2 3 4 0 4 2 3 4 2 4 1 0 1 3 4 2 1 4 3 2 0 3 4 3 4 4 4 2 2 1 0 4 1
3 2 2 4 3 2 3 0 3 4 0 2 0 3 0 1 0 3 4 2 4 1 1 1 3 1 1 4 1 0 3 1 2 1 1 1 0
0 3 0 4 2 1 4 4 0 1 2 3 2 2 1 0 2 0 4 0 3 2 3 2 3 3]
Update: just wanted to say this is so easy in R:
How to get the x which belongs to a quintile?
You could use argpartition. Example:
>>> a = np.random.random(20)
>>> N = len(a)
>>> nq = 5
>>> o = a.argpartition(np.arange(1, nq) * N // nq)
>>> out = np.empty(N, int)
>>> out[o] = np.arange(N) * nq // N
>>> a
array([0.61238649, 0.37168998, 0.4624829 , 0.28554766, 0.00098016,
0.41979328, 0.62275886, 0.4254548 , 0.20380679, 0.762435 ,
0.54054873, 0.68419986, 0.3424479 , 0.54971072, 0.06929464,
0.51059431, 0.68448674, 0.97009023, 0.16780152, 0.17887862])
>>> out
array([3, 1, 2, 1, 0, 2, 3, 2, 1, 4, 3, 4, 1, 3, 0, 2, 4, 4, 0, 0])
Here's one way to do it using pd.cut()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100))
df.columns = ['values']
# Apply the quantiles
gdf = df.groupby(pd.cut(df.loc[:, 'values'], np.arange(0, 1.2, 0.2)))['values'].apply(lambda x: list(x)).to_frame()
# Make use of the automatic indexing to assign quantile numbers
gdf.reset_index(drop=True, inplace=True)
# Re-expand the grouped list of values. Method provided by #Zero at https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows
gdf['values'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('values').reset_index()