Im trying to create top columns, which is the max of a couple of column rows. Pandas has a method nlargest but I cannot get it to work in rows. Pandas also has max and idxmax which does exactly what I want to do but only for the absolute max value.
df = pd.DataFrame(np.array([[1, 2, 3, 5, 1, 9], [4, 5, 6, 2, 5, 9], [7, 8, 9, 2, 5, 10]]), columns=['a', 'b', 'c', 'd', 'e', 'f'])
cols = df.columns[:-1].tolist()
df['max_1_val'] = df[cols].max(axis=1)
df['max_1_col'] = df[cols].idxmax(axis=1)
Output:
a b c d e f max_1_val max_1_col
0 1 2 3 5 1 9 5 d
1 4 5 6 2 5 9 6 c
2 7 8 9 2 5 10 9 c
But I am trying to get max_n_val and max_n_col so the expected output for top 3 would be:
a b c d e f max_1_val max_1_col max_2_val max_2_col max_3_val max_3_col
0 1 2 3 5 1 9 5 d 3 c 2 b
1 4 5 6 2 5 9 6 c 5 b 5 e
2 7 8 9 2 5 10 9 c 8 b 7 a
For improve performance is used numpy.argsort for positions, for correct order is used the last 3 items, reversed by indexing:
N = 3
a = df[cols].to_numpy().argsort()[:, :-N-1:-1]
print (a)
[[3 2 1]
[2 4 1]
[2 1 0]]
Then get columns names by indexing to c and for reordering values in d use this solution:
c = np.array(cols)[a]
d = df[cols].to_numpy()[np.arange(a.shape[0])[:, None], a]
Last create DataFrames, join by concat and reorder columns names by DataFrame.reindex:
df1 = pd.DataFrame(c).rename(columns=lambda x : f'max_{x+1}_col')
df2 = pd.DataFrame(d).rename(columns=lambda x : f'max_{x+1}_val')
c = df.columns.tolist() + [y for x in zip(df2.columns, df1.columns) for y in x]
df = pd.concat([df, df1, df2], axis=1).reindex(c, axis=1)
print (df)
a b c d e f max_1_val max_1_col max_2_val max_2_col max_3_val \
0 1 2 3 5 1 9 5 d 3 c 2
1 4 5 6 2 5 9 6 c 5 e 5
2 7 8 9 2 5 10 9 c 8 b 7
max_3_col
0 b
1 b
2 a
Related
I have the following df:
A B C
1 10 2
1 15 0
2 5 2
2 5 0
I add column D through:
df["D"] = (df.B - df.C).cumsum()
A B C D
1 10 2 8
1 15 0 23
2 5 2 26
2 5 0 31
I want the cumsum to restart in row 3 where the value in column A is different from the value in row 2.
Desired output:
A B C D
1 10 2 8
1 15 0 23
2 5 2 3
2 5 0 8
Try with
df['new'] = (df.B-df.C).groupby(df.A).cumsum()
Out[343]:
0 8
1 23
2 3
3 8
dtype: int64
Use groupby and cumsum
df['D'] = df.assign(D=df['B']-df['C']).groupby('A')['D'].cumsum()
A B C D
0 1 10 2 8
1 1 15 0 23
2 2 5 2 3
3 2 5 0 8
import pandas as pd
df = pd.DataFrame({"A": [1, 1, 2, 2], "B": [10, 15, 5, 5], "C": [2, 0, 2, 0]})
df['D'] = df['B'] - df['C']
df = df.groupby('A').cumsum()
print(df)
output:
B C D
0 10 2 8
1 25 2 23
2 5 2 3
3 10 2 8
I have two dataframes, which for simplicity look like:
A B C D E
1 2 3 4 5
5 4 3 2 1
1 3 5 7 9
9 7 5 3 1
And the second one looks like:
F
0
1
0
1
So, both dataframes have the SAME number of rows.
I want to attach column F to the first dataframe:
A B C D E F
1 2 3 4 5 0
5 4 3 2 1 1
1 3 5 7 9 0
9 7 5 3 1 1
I have already tried various methods such as joins, iloc, adding df['F'] manually, and I don't seem to find an answer. Most of the time I get F added to the dataframe, but with its data filled with NaN (e.g. the lines where the first dataframe was filled, I get NaN in F, and then I get double the number of rows with NaN everywhere, except F, where the data is OK).
It seems you want to add column F to the first dataframe regardless of the index of both dataframes. In that case, just assign through ndarray of column F
df1['F'] = df2['F'].to_numpy()
Out[131]:
A B C D E F
0 1 2 3 4 5 0
1 5 4 3 2 1 1
2 1 3 5 7 9 0
3 9 7 5 3 1 1
You have just to create a new column on the original dataframe assigning the result of the second dataframe:
generating the example
import pandas as pd
data1 = {"A": [1, 5, 1, 9],
"B": [2, 4, 3, 7],
"C": [3, 3, 5, 5],
"D": [4, 2, 7, 3],
"E": [5, 1, 9, 1]}
data2 = {"F": [0, 1, 0, 1]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
#creating the column
df1["F"] = df2.F
df1
> A B C D E F
> 0 1 2 3 4 5 0
> 1 5 4 3 2 1 1
> 2 1 3 5 7 9 0
> 3 9 7 5 3 1 1
my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c
You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
I have a dataframe that looks like this
A B C D G
0 9 5 7 6 1
1 1 4 7 3 1
2 8 4 1 3 1
generated by this:
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
x=np.array([[1,2]])
df['G'] = np.repeat(x,5)
Suppose there are times when a certain column 'E' exists, and sometimes it doesn't depending on the time frame of the data.
So sometimes we have
A B C D E G
0 9 5 7 6 2 1
1 1 4 7 3 3 1
2 8 4 1 3 4 1
So either way, I'd like to sum the rows from columns A, C, and E, and groupby column G. So when column E exists , I just use
df.groupby('G')['A', 'C', 'E'].sum()
but when E doesn't exist, like in the first dataframe, it doesn't work.
What do I need to do in order to sum even if a column is missing?
You could store the columns you wish to sum in a list sum_cols = list('ACE'), and then intersect whatever DataFrame you're working with with this list.
df.groupby('G')[df.columns.intersection(sum_cols)].sum()
Demo
>>> df = pd.DataFrame(np.random.randint(0, 10, (2, 5)),
columns=list('ABCDG'))
>>> df
A B C D G
0 9 5 9 2 6
1 3 1 1 1 3
>>> sum_cols = list('ACE')
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C
G
3 3 1
6 9 9
>>> df['E'] = [100, 200]
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C E
G
3 3 1 200
6 9 9 100
Given a dataframe:
>>> import pandas as pd
>>> lol = [['a', 1, 1], ['b', 1, 2], ['c', 1, 4], ['c', 2, 9], ['b', 2, 10], ['x', 2, 5], ['d', 2, 3], ['e', 3, 5], ['d', 2, 10], ['a', 3, 5]]
>>> df = pd.DataFrame(lol)
>>> df.rename(columns={0:'value', 1:'key', 2:'something'})
value key something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
The goal is to keep the last N rows for the unique values of the key column.
If N=1, I could simply use the .drop_duplicates() function as such:
>>> df.drop_duplicates(subset='key', keep='last')
value key something
2 c 1 4
8 d 2 10
9 a 3 5
How do I keep the last 3 rows for each unique values of key?
I could try this for N=3:
>>> from itertools import chain
>>> unique_keys = {k:[] for k in df['key']}
>>> for idx, row in df.iterrows():
... k = row['key']
... unique_keys[k].append(list(row))
...
>>>
>>> df = pd.DataFrame(list(chain(*[v[-3:] for k,v in unique_keys.items()])))
>>> df.rename(columns={0:'value', 1:'key', 2:'something'})
value key something
0 a 1 1
1 b 1 2
2 c 1 4
3 x 2 5
4 d 2 3
5 d 2 10
6 e 3 5
7 a 3 5
But there must be a better way...
Is this what you want ?
df.groupby('key').tail(3)
Out[127]:
value key something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Does this help:
for k,v in df.groupby('key'):
print v[-2:]
value key something
1 b 1 2
2 c 1 4
value key something
6 d 2 3
8 d 2 10
value key something
7 e 3 5
9 a 3 5