Related
I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)
Consider the example below
dfa = pd.DataFrame({'type' : ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q'],
'value' : [2,3,4,2,5,3,6,5,3,1,3,5,7,5,3,5,4],
'date' : [pd.to_datetime('2021-01-01')]*17})
dfa
Out[337]:
type value date
0 a 2 2021-01-01
1 b 3 2021-01-01
2 c 4 2021-01-01
3 d 2 2021-01-01
4 e 5 2021-01-01
5 f 3 2021-01-01
6 g 6 2021-01-01
7 h 5 2021-01-01
8 i 3 2021-01-01
9 j 1 2021-01-01
10 k 3 2021-01-01
11 l 5 2021-01-01
12 m 7 2021-01-01
13 n 5 2021-01-01
14 o 3 2021-01-01
15 p 5 2021-01-01
16 q 4 2021-01-01
As you can see, I have (too) many categories but I still need to plot all of them at the same time. I tried to use the hatch argument in matplotlib but this does not seem to shade some patterns and not the others (so that more categories are visually distinct).
dfa.set_index(['date','type']).unstack().plot.bar(stacked = True, hatch = 'o')
What can I do here?
Thanks!
You could loop through the generated bars, and assign a unique hatch pattern to each individual group. You'll need to generate the legend again, so it gets updated with the changed bars.
Choosing bar.set_hatch(pattern * 2) instead of just bar.set_hatch(pattern) will generate a pattern that is twice as dense. See the hatch demo for more examples.
import matplotlib.pyplot as plt
import pandas as pd
dfa = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q'] * 2,
'value': [2, 3, 4, 2, 5, 3, 6, 5, 3, 1, 3, 5, 7, 5, 3, 5, 4,
3, 2, 2, 1, 5, 2, 7, 2, 3, 7, 5, 3, 5, 3, 5, 4, 3],
'date': [pd.to_datetime('2021-01-01')] * 17 + [pd.to_datetime('2021-01-02')] * 17})
ax = dfa.set_index(['date', 'type']).unstack().plot.bar(stacked=True, rot=0)
hatch_patterns = ['/', '\\', '|', '-', '+', 'x', 'o', 'O', '.', '*', '/o', '\\|', '|*', '-\\', '+o', 'x*', 'o-', 'O|']
for bars, pattern in zip(ax.containers, hatch_patterns):
for bar in bars:
bar.set_hatch(pattern * 2)
ax.legend(bbox_to_anchor=(1.01, 1.01), loc='upper left')
plt.tight_layout()
plt.show()
I have a dataframe that looks like this:
A B C D E
0 P 10 NaN 5.0 9.0
1 Q 19 NaN NaN 4.0
2 R 8 NaN 3.0 7.0
3 S 20 NaN 3.0 7.0
4 T 4 NaN 2.0 NaN
And I have a list: [['A', 'B', 'D', 'E'], ['A', 'B', 'D'], ['A', 'B', 'E']]
I am iterating over the list and getting only those rows from the dataframe, for which the columns specified by the list are not empty.
I have tried with the following code:
test_df = pd.DataFrame([['P', 10, np.nan, 5, 9], ['Q', 19, np.nan, np.nan, 4], ['R', 8, np.nan, 3, 7],
['S', 20, np.nan, 3, 7], ['T', 4, np.nan, 2, np.nan]], columns=list('ABCDE'))
priority_list = [list('ABDE'), list('ABD'), list('ABE')]
for elem in priority_list:
test_df = test_df.loc[test_df[elem].notna()]
print(test_df)
But this is throwing the following error:
File "C:\Python37\lib\site-packages\pandas\core\indexing.py", line 879, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\Python37\lib\site-packages\pandas\core\indexing.py", line 1097, in _getitem_axis
raise ValueError("Cannot index with multidimensional key")
ValueError: Cannot index with multidimensional key
How to overcome this issue and check for multiple columns for non-na values in the dataframe?
Use DataFrame.all for test if all selected values are Trues:
priority_list = [list('ABDE'), list('ABD'), list('ABE')]
for elem in priority_list:
test_df = test_df.loc[test_df[elem].notna().all(axis=1)]
print(test_df)
I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])
I need to shift a grouped data frame by a dynamic number. I can do it with apply, but the performance is not very good.
Any way to do that without apply?
Here is a sample of what I would like to do:
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
df['SUM'] = df.groupby('GROUP').VALUE.cumsum()
# THIS DOESN'T WORK:
df['VALUE'] = df.groupby('GROUP').SUM.shift(df.SHIFT)
I do it with apply the following way:
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
def func(group):
s = group.SHIFT.iloc[0]
group['SUM'] = group.SUM.shift(s)
return group
df['SUM'] = df.groupby('GROUP').VALUE.cumsum()
df = df.groupby('GROUP').apply(func)
Here is a pure numpy version that works if the data frame is sorted by group (like your example):
# these rows are not null after shifting
notnull = np.where(df.groupby('GROUP').cumcount() >= df['SHIFT'])[0]
# source rows for rows above
source = notnull - df['SHIFT'].values[notnull]
shifted = np.empty(df.shape[0])
shifted[:] = np.nan
shifted[notnull] = df.groupby('GROUP')['VALUE'].cumsum().values[source]
df['SUM'] = shifted
It first gets the indices of rows that are to be updated. The shifts can be subtracted to yield the source rows.
A solution that avoids apply, could be the following, if the groups are contiguous:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
# compute values required for the slices
_, start = np.unique(df.GROUP.values, return_index=True)
gp = df.groupby('GROUP')
shifts = gp.SHIFT.first()
sizes = gp.size().values
end = (sizes - shifts.values) + start
# compute slices
source = [i for s, f in zip(start, end) for i in range(s, f)]
target = [i for j, s, f in zip(start, shifts, sizes) for i in range(j + s, j + f)]
# compute cumulative sum and arrays of nan
s = gp.VALUE.cumsum().values
r = np.empty_like(s, dtype=np.float32)
r[:] = np.nan
# set the on the array of nan
np.put(r, target, s[source])
# set the sum column
df['SUM'] = r
print(df)
Output
GROUP SHIFT VALUE SUM
0 A 2 1 NaN
1 A 2 2 NaN
2 A 2 3 1.0
3 A 2 4 3.0
4 A 2 5 6.0
5 A 2 6 10.0
6 B 3 7 NaN
7 B 3 8 NaN
8 B 3 9 NaN
9 B 3 0 7.0
10 B 3 1 15.0
11 B 3 2 24.0
With the exception of building the slices (source and target) all computations are done in a pandas/numpy level that should be fast. The idea is to manually simulate what would be done in the apply function.