Comparing Overlap in Pandas Columns

Comparing Overlap in Pandas Columns - python

So I have four columns in a pandas dataframe, column A, B, C and D. Column A contains 30 words, 18 of which are in column B. Column C contains either a 1 or 2 (keyboard response to column B words) and column D contains 1 or 2 also (the correct response).
What I need to do is see the total correct for only the words where column A and B overlap. I understand how to compare the C and D columns to get the total correct once I have the correct dataframe, but I am having a hard time wrapping my head around comparing the overlap in A and B.

Use Series.isin():
df.B.isin(df.A)
That will give you a boolean Series the same length as df.B indicating for each value in df.B whether it is also present in df.A.

Related

Count how many times any value appears in a combination of columns but not in other columns (pandas)

I have a data set that looks like this:
A B C
1 2
3 4
1 5
1 2 4
1 2
1
I want to return the amount of times any value appears in a combination of two columns but not the third column. The numeric values in this case are arbitrary. I just care about counting instances.
In other words:
1.) How many times does any value appear in column A and column B but not column C?
2.) How many times does any value appear in column B and column C but not column A?
3.) How many times does any value appear in column A and column C but not in column B?
My expected answers based on the mock data I have given above:
1.) 1 (row 1)
2.) 1 (row 2)
3.) 2 (rows 3 and 5)

You could use isna to create a boolean DataFrame. Then filter the rows that have only one NaN value (so that we exclude the last row). Finally sum vertically:
df_na = df.isna()
df_na[df_na.sum(axis=1).eq(1)].sum()
Output:
A 1
B 2
C 1
dtype: int64
Then for example, column "A" doesn't have a value while the other two have values once, "B" doesn't have a value while the other two have twice, etc.

Lambda function with groupby and condition issue

I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())

Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2

Pandas reindexing task based on a column value

I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?

You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3

Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4

Python Pandas select group where a specific column contains zeroes

I'm working on a small project using Python Pandas and I'm stuck at the following problem:
I have a table where column A contains multiple and possibly non unique values and a second column B with values which might be zero. Now I want to group all rows in the DataFrame by their value in column A and then only "keep" or "select" the groups which contain one or more zeros in the B column.
For example from a DataFrame that looks like this:
Column A Column B
-------- --------
b 12
c 56
f 0
b 456
b 334
f 10
I am only interested in all rows (the group) where column A = f :
Column A Column B
-------- --------
f 0
f 10
I know how I could achieve this using loops and iterating over groups but I'm looking for a simple and reasonably fast code as the DataFrames I work with can get very huge.
My current approach is something like this:
df.groupby("A").filter(lambda x: 0 in x["B"].values)
Obviously I'm new to Python Pandas and am hoping for your help !
Thank you in advance !

One way is to get all values of column A where column B is zero, and then group on this filtered set.
groups = df[df['Column B'] == 0]['Column A'].unique()
>>> df[df['Column A'].isin(groups)]
Column A Column B
2 f 0
5 f 10

Conditional selection of data in a pandas DataFrame

I have two columns in my pandas DataFrame.
A B
0 1 5
1 2 3
2 3 2
3 4 0
4 5 1
I need the value in A where the value of B is minimum. In the above case my answer would be 4 since the minimum B value is 0.
Can anyone help me with it?

To find the minimum in column B, you can use df.B.min(). For your DataFrame this returns 0.
To find values at particular locations in a DataFrame, you can use loc:
>>> df.loc[(df.B == df.B.min()), 'A']
3 4
Name: A, dtype: int64
So here, loc picks out all of the rows where column B is equal to its minimum value (df.B == df.B.min()) and selects the corresponding values in column A.
This method returns all values in A corresponding to the minimum value in B. If you only need to find one of the values, the better way is to use idxmin as #aus_lacy suggests.

Here's one way:
b_min = df.B.idxmin()
a_val = df.A[b_min]
idxmin() returns the index of the minimum value within column B. You then locate the value at that same index in column A.
or if you want a single, albeit less readable, line you can do it like:
a_val = df.A[df.B.idxmin()]
Also, as a disclaimer this solution assumes that the minimum value in column B is unique. For example if you were to have a data set that looked like this:
A B
1 2
2 5
3 0
4 3
5 0
My solution would return the first instance where B's minimum value is located which in this case is in the third row and has a corresponding A value of 3. If you believe that the minimum value of B is not unique then you should go with #ajcr's solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing Overlap in Pandas Columns - python

Use Series.isin(): df.B.isin(df.A) That will give you a boolean Series the same length as df.B indicating for each value in df.B whether it is also present in df.A.

Related

Count how many times any value appears in a combination of columns but not in other columns (pandas)

Lambda function with groupby and condition issue

Pandas reindexing task based on a column value

Python Pandas select group where a specific column contains zeroes

Conditional selection of data in a pandas DataFrame

Categories

Resources