Adding certain columns from dataframe - python

I have the following dataframe df:
A B C D E
J 4 2 3 2 3
K 5 2 6 2 1
L 2 6 5 4 7
I would like to create an additional column that adds by index the df except column A (which also are numbers), therefore what I have tried is :
df['summation'] = df.iloc[:, 1:4].sum(axis=0)
However, the column summation is added but gives NaN values.
Desired output is:
A B C D E summation
J 4 2 3 2 3 10
K 5 2 6 2 1 11
L 2 6 5 4 7 22
The sum along the row starting at B to the end.

As pointed out in the comments, you apply sum on the wrong axis. If you want to exclude columns from the sum, you can use drop (which also accepts a list of column names which might be handy if you want to exclude columns at e.g. index 0 and 3; then iloc might not be ideal)
df.drop('A', axis=1).sum(axis=1)
which yields
J 10
K 11
L 22
Also #ayhan's solution in the comments works fine.

Related

groupby a column to get a minimum value and corresponding value in another column [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Is there a way to filter out rows from a table with an unnamed column

I'm currently trying to do analysis of rolling correlations of a dataset with four compared values but only need the output of rows containing 'a'
I got my data frame by using the command newdf = df.rolling(3).corr()
Sample input (random numbers)
a b c d
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b 5 6 3
3 c 4 3 1
3 d 3 4 2
4 a 1 3 5 6
4 b 6 2 4 1
4 c 8 6 6 7
4 d 2 5 4 6
5 a 2 5 4 1
5 b 1 4 6 3
5 c 2 6 3 7
5 d 3 6 3 7
and need the output
a b c d
1 a 1 3 5 6
2 a 2 5 4 1
I've tried filtering it by doing adf = newdf.filter(['a'], axis=0) however that gets rid of everything and when doing it for the other axis it filters by column. Unfortunately the column containing the rows with values: a, b, c, d is unnamed so I cant filter that column individually. This wouldn't be an issue however if its possible to flip the rows and columns with the values being listed by index to get the desired output.
Try using loc. Put the column of abcdabcd ... as index and just use loc
df.loc['a']
The actual source of problem in your case is that your DataFrame
has a MultiIndex.
So when you attempt to execute newdf.filter(['a'], axis=0) you want
to leave rows with the index containing only "a" string.
But since your DataFrame has a MultiIndex, each row with "a" at
level 1 contains also some number at level 0.
To get your intended result, run:
newdf.filter(like='a', axis=0)
maybe followed by .dropna().
An alterantive solution is:
newdf.xs('a', level=1, drop_level=False)

How to remove duplicates based on two columns removing the the largest of 3rd column in pandas dataframe?

Suppose I have a pandas dataframe that is like this:
df=
A B 6 2
A C 4 2
D F 9 3
K L 8 9
A B 4 3
D F 8 2
How can I say, if columns A and B have duplicates remove the ones that have the largest column C?
So for instance we can see lines 1 and 5 have the same columns A and B.
A B 6 2 (Line 1)
A B 4 3 (Line 5)
I want to remove line 1 as 6 is greater than 4.
So my output should be
A C 4 2
K L 8 9
A B 4 3
D F 8 2
Try sorting the column in descending order on which you need to find max value using
pd.sort_values
Then drop_duplicates using pd.drop_duplicate
df.sort_values(by=['C'],ascending=[True],inplace=True)
df.drop_duplicates(subset=['A','B'],inplace=True)

Pandas: remove old DataFrame from memory after groupby

value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas
df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.
Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows

Pandas: Get highest n rows based on multiple columns and they are matching each other

Suppose I have pandas DataFrame like this. Those red values in column C and E are the highest 10 numbers in each column accordingly.
How can i get a data frame like this. Where it only returns the rows which are in the highest 10 on both columns? If the value is in the highest 10 but not in both then the row would be ignored.
At the moment i do this with looping where i loop first through each column separately and if the value is in the highest 10 then i save the row index, and then i loop a third time where i exclude indexes which are not in both, This is very inefficient since i work with a table of a over 100000 rows. Is there a better way to do it?
Consider the example dataframe df
np.random.seed([3,1415])
rng = np.arange(10)
df = pd.DataFrame(
dict(
A=rng,
B=list('abcdefghij'),
C=np.random.permutation(rng),
D=np.random.permutation(rng)
)
)
print(df)
A B C D
0 0 a 9 1
1 1 b 4 3
2 2 c 5 5
3 3 d 1 9
4 4 e 7 4
5 5 f 6 6
6 6 g 8 0
7 7 h 3 2
8 8 i 2 7
9 9 j 0 8
Use nlargest to identify lists. Then use query to filter dataframe
n = 5
c_lrgst = df.C.nlargest(n)
d_lrgst = df.D.nlargest(n)
df.query('C in #c_lrgst & D in #d_lrgst')
A B C D
2 2 c 5 5
5 5 f 6 6

Categories