Python Pandas select group where a specific column contains zeroes - python

I'm working on a small project using Python Pandas and I'm stuck at the following problem:
I have a table where column A contains multiple and possibly non unique values and a second column B with values which might be zero. Now I want to group all rows in the DataFrame by their value in column A and then only "keep" or "select" the groups which contain one or more zeros in the B column.
For example from a DataFrame that looks like this:
Column A Column B
-------- --------
b 12
c 56
f 0
b 456
b 334
f 10
I am only interested in all rows (the group) where column A = f :
Column A Column B
-------- --------
f 0
f 10
I know how I could achieve this using loops and iterating over groups but I'm looking for a simple and reasonably fast code as the DataFrames I work with can get very huge.
My current approach is something like this:
df.groupby("A").filter(lambda x: 0 in x["B"].values)
Obviously I'm new to Python Pandas and am hoping for your help !
Thank you in advance !

One way is to get all values of column A where column B is zero, and then group on this filtered set.
groups = df[df['Column B'] == 0]['Column A'].unique()
>>> df[df['Column A'].isin(groups)]
Column A Column B
2 f 0
5 f 10

Related

Pandas find the first occurrence of a specific value in a row within multiple columns and return column index

For a dataframe:
df = pd.DataFrame({"A":[0,0],"B":[0,1],"C":[1,2],"D":[2,2]})
How to obtain the column name or column index when the value is 2 or a certain value
and put it in a new column at df, say df["TAG"]
df = pd.DataFrame({"A":[0,0],"B":[0,1],"C":[1,2],"D":[2,2],"TAG":[D,C]})
i tried
df["TAG"]=np.where(df[cols]>=2,df.columns,'')
where [cols] is the list of df columns
So far i can only find how to find row index when matching a value in Pandas
In excel we can do some approach using MATCH(TRUE,INDEX($A:$D>=2,0),) and apply to multiple rows
Any help or hints are appreciated
Thank you so much in advance
Try idxmax:
>>> df['TAG'] = df.ge(2).T.idxmax()
>>> df
A B C D TAG
0 0 0 1 2 D
1 0 1 2 2 C
>>>

aggregating and counting in pandas

for the following df
group participated
A 1
A 1
B 0
A 0
B 1
A 1
B 0
B 0
I want to count the total number of values in the participated column for each value in the group column (groupby-count) and then find a count of how many 1s there are in each group too
Something like
group tot_participated 1s
A 4 3
B 4 1
I know the first part is simple and can be done by a simple
grouped_df=df.groupby('group').count().reset_index()
unable to wrap my head around the second part. Any help will be greatly appreciated!
You could follow the groupby with an aggregation as below:
grp_df = df.groupby('group', as_index=False).agg({'participated':['count','sum']})
grp_df.columns = ['group','tot_participated','1s']
grp_df.head()
The caveat to using .agg with multiple aggregation functions on the same column is that a multi-column index is created. This can be remedied by resetting the column names as in line 2.

Comparing Overlap in Pandas Columns

So I have four columns in a pandas dataframe, column A, B, C and D. Column A contains 30 words, 18 of which are in column B. Column C contains either a 1 or 2 (keyboard response to column B words) and column D contains 1 or 2 also (the correct response).
What I need to do is see the total correct for only the words where column A and B overlap. I understand how to compare the C and D columns to get the total correct once I have the correct dataframe, but I am having a hard time wrapping my head around comparing the overlap in A and B.
Use Series.isin():
df.B.isin(df.A)
That will give you a boolean Series the same length as df.B indicating for each value in df.B whether it is also present in df.A.

Lambda function with groupby and condition issue

I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2

New columns with incremental numbers that initial based on a diffrent column value (pandas)

I want to add a column with incremental numbers for rows with the same value in a defined row;
e.g. if I would have this df
df=pd.DataFrame([['a','b'],['a','c'],['c','b']])
and I want incremental numbers for the first column. It should look like this
df=pd.DataFrame([['a','b',1],['a','c',2],['c','b',1]])
I found sql solutions but I'm working with ipython/pandas. Can someone help me?
Use cumcount, for name of new column use length of original columns:
print (len(df.columns))
2
df[len(df.columns)] = df.groupby(0).cumcount() + 1
print (df)
0 1 2
0 a b 1
1 a c 2
2 c b 1

Categories