Conditional selection of data in a pandas DataFrame

Conditional selection of data in a pandas DataFrame - python

I have two columns in my pandas DataFrame.
A B
0 1 5
1 2 3
2 3 2
3 4 0
4 5 1
I need the value in A where the value of B is minimum. In the above case my answer would be 4 since the minimum B value is 0.
Can anyone help me with it?

To find the minimum in column B, you can use df.B.min(). For your DataFrame this returns 0.
To find values at particular locations in a DataFrame, you can use loc:
>>> df.loc[(df.B == df.B.min()), 'A']
3 4
Name: A, dtype: int64
So here, loc picks out all of the rows where column B is equal to its minimum value (df.B == df.B.min()) and selects the corresponding values in column A.
This method returns all values in A corresponding to the minimum value in B. If you only need to find one of the values, the better way is to use idxmin as #aus_lacy suggests.

Here's one way:
b_min = df.B.idxmin()
a_val = df.A[b_min]
idxmin() returns the index of the minimum value within column B. You then locate the value at that same index in column A.
or if you want a single, albeit less readable, line you can do it like:
a_val = df.A[df.B.idxmin()]
Also, as a disclaimer this solution assumes that the minimum value in column B is unique. For example if you were to have a data set that looked like this:
A B
1 2
2 5
3 0
4 3
5 0
My solution would return the first instance where B's minimum value is located which in this case is in the third row and has a corresponding A value of 3. If you believe that the minimum value of B is not unique then you should go with #ajcr's solution.

Related

Count how many times any value appears in a combination of columns but not in other columns (pandas)

I have a data set that looks like this:
A B C
1 2
3 4
1 5
1 2 4
1 2
1
I want to return the amount of times any value appears in a combination of two columns but not the third column. The numeric values in this case are arbitrary. I just care about counting instances.
In other words:
1.) How many times does any value appear in column A and column B but not column C?
2.) How many times does any value appear in column B and column C but not column A?
3.) How many times does any value appear in column A and column C but not in column B?
My expected answers based on the mock data I have given above:
1.) 1 (row 1)
2.) 1 (row 2)
3.) 2 (rows 3 and 5)

You could use isna to create a boolean DataFrame. Then filter the rows that have only one NaN value (so that we exclude the last row). Finally sum vertically:
df_na = df.isna()
df_na[df_na.sum(axis=1).eq(1)].sum()
Output:
A 1
B 2
C 1
dtype: int64
Then for example, column "A" doesn't have a value while the other two have values once, "B" doesn't have a value while the other two have twice, etc.

Take mean of column 2 values for items that fall within first quintile values of column 1

I have three columns. One identifier column and two numeric columns. We'll call them Identifier, A, and B, respectively.
I have found the first quintile average value within column A and would like to calculate the average of the column B but only for the identifiers that fall within the first quintile of column A. So for the identifiers in the column A first quintile, I want to find the column B average.
I found the first quintile leveraging:
df.A.quantile([1])
Now I want to find the average of column B for the identifiers that make up the column A first quintile. For example:
Identifier
A
B
A
1
8
B
2
7
C
3
6
D
4
5
E
5
4
F
6
3
G
7
2
H
8
1
The first quintile of column A consists of identifier G and H and lies at a value of 7.5. I would like to take the average of column B for those identifiers to arrive at 1.5.
Any help in this endeavor would be much appreciated!
Best,
Kilian

Pandas COUNTIF based on column value

I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work

You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64

df.apply(lambda x: (x == x[0]).sum()-1,axis=1)

How can I average ACROSS groups in python-pandas?

I have a dataset like this:
Participant Type Rating
1 A 6
1 A 5
1 B 4
1 B 3
2 A 9
2 A 8
2 B 7
2 B 6
I want obtain this:
Type MeanRating
A mean(6,9)
A mean(5,8)
B mean(4,7)
B mean(3,6)
So, for each type, I want the mean of the higher value in each group, then the mean of the second higher value in each group, etc.
I can't think up a proper way to do this with python pandas, since the means seem to apply always within groups, but not across them.

First use groupby.rank to create a column that allows you to align the highest values, second highest values, etc. Then perform another groupby using the newly created column to compute the means:
# Get the grouping column.
df['Grouper'] = df.groupby(['Type', 'Participant']).rank(method='first', ascending=False)
# Perform the groupby and format the result.
result = df.groupby(['Type', 'Grouper'])['Rating'].mean().rename('MeanRating')
result = result.reset_index(level=1, drop=True).reset_index()
The resulting output:
Type MeanRating
0 A 7.5
1 A 6.5
2 B 5.5
3 B 4.5
I used the method='first' parameter of groupby.rank to handle the case of duplicate ratings within a ['Type', 'Participant'] group. You can omit it if this is not a possibility within your dataset, but it won't change the output if you leave it and there are no duplicates.

Group counts. Why every column?

I often need to know how many entries I have in each group in a dataframe in Pandas. The following does it, but it returns one value for every column in my dataframe.
df.groupby(['A', 'B', 'C']).count()
That is, if I have, say 20 columns (where A, B and C are three of them), it would return 17 counts, all identical (at least every time I have done it) within each group.
What is the rationale behind this?
Is there any way to restrict the count to only one column? (or have it return only one value per group?)
Would that speed up the counts in any way?
The method dataFrameGroupBy.count doesn't seem to have an argument to specify on which columns to do the count (I also could not find it on the API ref)

groupby(...).count() returns the count of non null values in each column. So potentially it can be different for each column.
example:
>>> df
jim joe jolie
0 4 NaN 4
1 8 0 NaN
>>> df.groupby('jim').count()
joe jolie
jim
4 0 1
8 1 0
.groupby(...).size() returns the size of each group.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional selection of data in a pandas DataFrame - python

I have two columns in my pandas DataFrame. A B 0 1 5 1 2 3 2 3 2 3 4 0 4 5 1 I need the value in A where the value of B is minimum. In the above case my answer would be 4 since the minimum B value is 0. Can anyone help me with it?

Related

Count how many times any value appears in a combination of columns but not in other columns (pandas)

Take mean of column 2 values for items that fall within first quintile values of column 1

Pandas COUNTIF based on column value

How can I average ACROSS groups in python-pandas?

Group counts. Why every column?

Categories

Resources