Iterate over results of value_counts() on a groupby object - python

I have a dataframe like df = pd.DataFrame({'ID':[1,1,2,2,3,3,4,4,5,5,5],'Col1':['Y','Y','Y','N','N','N','Y','Y','Y','N','N']}). What I would like to do is group by the 'ID' column and then get statistics on three conditions:
How many groups have only 'Y's
How many groups have at least 1 'Y' and at least 1 'N'
How many groups have only 'N's
groups = df.groupby('ID') groups.Col1.value_counts()
gives me a visual representation of what I'm looking for, but how can I then iterate over the results of the value_counts() method to check for these conditions?

I think pd.crosstab() may be more suitable for your use case.
Code
df_crosstab = pd.crosstab(df["ID"], df["Col1"])
Col1 N Y
ID
1 0 2
2 1 1
3 2 0
4 0 2
5 2 1
Groupby can also do the job, but much more tedious:
df_crosstab = df.groupby('ID')["Col1"]\
.value_counts()\
.rename("count")\
.reset_index()\
.pivot(index="ID", columns="Col1", values="count")\
.fillna(0)
Filtering the groups
After producing df_crosstab, the filters for your 3 questions could be easily constructed:
# 1. How many groups have only 'Y's
df_crosstab[df_crosstab['N'] == 0]
Col1 N Y
ID
1 0 2
4 0 2
# 2. How many groups have at least 1 'Y' and at least 1 'N'
df_crosstab[(df_crosstab['N'] > 0) & (df_crosstab['Y'] > 0)]
Col1 N Y
ID
2 1 1
5 2 1
# 3. How many groups have only 'N's
df_crosstab[df_crosstab['Y'] == 0]
Col1 N Y
ID
3 2 0
If you want the number of groups only, just take the length of the the filtered crosstab dataframe. I believe this also makes automation much easier.

groups = df.groupby('ID')
answers = groups.Col1.value_counts()
for item in answers.iteritems():
print(item)
What you are making is a series from value_counts() and you can iterate over them. Note that this is not what you want. You would have to check each of these items for the tests you are looking for.

If you group by 'ID' and use 'sum' function, you will have all letters in one line for each group. Then you can just count strings to check your conditions and take their sums to know the exact numbers for all of the groups:
strings = df.groupby(['ID']).sum()
only_y = sum(strings['Col1'].str.count('N') == 0)
only_n = sum(strings['Col1'].str.count('Y') == 0)
both = sum((strings['Col1'].str.count('Y') > 0) & (strings['Col1'].str.count('N') > 0))
print('Number of groups with Y only: ' + str(only_y),
'Number of groups with N only: ' + str(only_n),
'Number of groups with at least one Y and one N: ' + str(both),
sep='\n')

Related

groupby name and position in the group

I would like to group by a column and then split one or more of the groups into two.
Exmaple
This
np.random.seed(11)
df = pd.DataFrame({"animal":np.random.choice( ['panda','python','shark'], 10),
"number": 1})
df.sort_values("animal")
gives me this dataframe
animal number
1 panda 1
4 panda 1
7 panda 1
9 panda 1
0 python 1
2 python 1
3 python 1
5 python 1
8 python 1
6 shark 1
Now I would like to group by animal but also split the "pythons" into the first two and the rest of the "pythons". So that
df.grouby(your_magic).sum()
gives me
number
animal
panda 4
python_1 2
python_2 3
shark 1
What about
np.random.seed(11)
df = pd.DataFrame({"animal":np.random.choice( ['panda','python','shark'], 10),
"number": 1})
## find index on which you split python into python_1 and python_2
python_split_idx = df[df['animal'] == 'python'].iloc[2].name
## rename python according to index
df[df['animal'] == 'python'] = df[df['animal'] == 'python'].apply(lambda row: pd.Series(['python_1' if row.name < python_split_idx else 'python_2', row.number], index=['animal', 'number']), axis=1)
## group according to all animals and sum the number
df.groupby('animal').agg({'number': sum})
Output:
number
animal
panda 4
python_1 2
python_2 3
shark 1
I ended up using something similar to Stefan's answer but slightly reformulated it to avoid having to use apply. This looks a bit cleaner to me.
idx1 = df[df['animal'] == 'python'].iloc[:2].index
idx2 = df[df['animal'] == 'python'].iloc[2:].index
df.loc[idx1, "animal"] = "python_1"
df.loc[idx2, "animal"] = "python_2"
df.groupby("animal").sum()
This is a quick and dirty (and somewhat inefficient) way to do it if you want to rename all your pythons before you group them.
indices = []
for i,v in enumerate(df['animal']):
if v == 'python':
if len(indices) <2:
indices.append(i)
df.loc[i,'animal'] = 'python_1'
else:
df.loc[i,'animal'] = 'python_2'
grouped = df.groupby('animal').agg('sum')
print(grouped)
This provides your desired output exactly.
As an alternative, here's a totally different approach that creates another column to capture whether each animal is a member of the group of the top two pythons and then groups on both columns.
snakes = df[df['animal'] == 'python']
df['special_snakes'] = [1 if i not in snakes.index[:2] else 0 for i in df.index]
df.groupby(['animal', 'special_snakes']).agg('sum')
The output looks a bit different, but achieves the same outcome. This approach also has the advantage of capturing the condition on which you are grouping your animals without actually changing the values in the animal column.
number
animal special_snakes
panda 1 4
python 0 2
1 3
shark 1 1

Selecting from pandas groups without replacement when possible

Say that I have a Dataframe that looks like:
Name Group_Id
A 1
B 2
C 2
I want a piece of code that selects n sets such that, as much as possible would contain different members of the same group.
A representative from each group must appear in each set (the representatives should be picked at random).
Only if the group's size is smaller than n, the same representatives would appear in multiple sets.
n is smaller or equal to the size of the biggest group.
So for example, for the above Dataframe and n=2 this would be a valid result:
set 1
Name Group_Id
A 1
B 2
set 2
Name Group_Id
A 1
C 2
however this one is not
set 1
Name Group_Id
A 1
B 2
set 2
Name Group_Id
A 1
B 2
One way could be to sample with replacement each group which has a smaller size than that of the largest group, such that each dataframe will have a sample from each group. Then interleave the inner groups' rows, and build a list of dataframes as shared:
# size of largest group
max_size = df.groupby('Group_Id').size().max()
# upsample group if necessary
l = [g.sample(max_size, replace=True) if g.shape[0]<max_size else g
for _,g in df.groupby('Group_Id')]
# interleave rows and build list of dataframes
[pd.DataFrame(g, columns=df.columns) for g in zip(*(i.to_numpy().tolist() for i in l))]
[ Name Group_Id
0 A 1
1 B 2,
Name Group_Id
0 A 1
1 C 2]
Here's an idea:
# 1. label a random order within each Group_Id
df['sets'] = df.sample(frac=1).groupby('Group_Id').cumcount()
# 2. pivot the table and using ffill
sets = (df.pivot(index='sets',columns='Group_Id').ffill() # for groups with fewer than N elements, choose the last elements always
.stack('Group_Id').reset_index('Group_Id') # return Group_Id as a normal column
)
# slices:
N = 2
for i in range(N):
print(sets.loc[i])
Output:
Group_Id Name
sets
0 1 A
0 2 C
Group_Id Name
sets
1 1 A
1 2 B

How to find the number of an element in a column of a dataframe

For example, I have a dataframe A likes below :
a b c
x 0 2 1
y 1 3 2
z 0 2 4
I want to get the number of 0 in column 'a' , which should returns 2. ( A[x][a] and A[z][a] )
Is there a simple way or is there a function I can easily do this?
I've Googled for it, but there are only articles like this.
count the frequency that a value occurs in a dataframe column
Which makes a new dataframe, and is too complicated to what I only need to do.
Use sum with boolean mask - Trues are processes like 1, so output is count of 0 values:
out = A.a.eq(0).sum()
print (out)
2
Try value_counts from pandas (here):
df.a.value_counts()["0"]
If the values are changeable, do it with df[column_name].value_counts()[searched_value]

How to get pandas dataframe series name given a column value?

I have a python pandas dataframe with a bunch of names and series, and I create a final column where I sum up the series. I want to get just the row name where the sum of the series equals 0, so I can then later delete those rows. My dataframe is as follows (the last column I create just to sum up the series):
1 2 3 4 total
Ash 1 0 1 1 3
Bel 0 0 0 0 0
Cay 1 0 0 0 1
Jeg 0 1 1 1 3
Jut 1 1 1 1 4
Based on the last column, the series "Bel" is 0, so I want to be able to print out that name only, and then later I can delete that row or keep a record of these rows.
This is my code so far:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for values in df['total']:
if values == 0:
print(df.index[values)
But this obviously is wrong because I am passing the index of 0 to this loop, which will always print the name of the first row. Not sure what method I can implement here though?
There are great solutions below and I also found a way using a simpler python skill, enumerate (because I still find list comprehension hard to write):
def check_empty(df):
df['total'] = df.sum(axis=1)
for name, values in enumerate(df['total']):
if values == 0:
print(df.index[name])
One possible way may be following where df is filtered using value in total:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
index = df[df['total'] == 0].index.values.tolist()
print(index)
If you would like to iterate through row then, using df.iterrows() may be other way as well:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for index, row in df.iterrows():
if row['total'] == 0:
print(index)
Another option is np.where.
import numpy as np
df.iloc[np.where(df.loc[:, 'total'] == 0)]
Output:
1 2 3 4 total
Bel 0 0 0 0 0

Selecting multiple (neighboring) rows conditionally

I'd like to return the rows which qualify to a certain condition. I can do this for a single row, but I need this for multiple rows combined. For example 'light green' qualifies to 'XYZ' being positive and 'total' > 10, where 'Red' does not. When I combine a neighbouring row or rows, it does => 'dark green'. Can I achieve this going over all the rows and not return duplicate rows?
N = 1000
np.random.seed(0)
df = pd.DataFrame(
{'X':np.random.uniform(-3,10,N),
'Y':np.random.uniform(-3,10,N),
'Z':np.random.uniform(-3,10,N),
})
df['total'] = df.X + df.Y + df.Z
df.head(10)
EDIT;
Desired output is 'XYZ'> 0 and 'total' > 10
Here's a try. You would maybe want to use rolling or expanding (for speed and elegance) instead of explicitly looping with range, but I did it that way so as to be able to print out the rows being used to calculate each boolean.
df = df[['X','Y','Z']] # remove the "total" column in order
# to make the syntax a little cleaner
df = df.head(4) # keep the example more manageable
for i in range(len(df)):
for k in range( i+1, len(df)+1 ):
df_sum = df[i:k].sum()
print( "rows", i, "to", k, (df_sum>0).all() & (df_sum.sum()>10) )
rows 0 to 1 True
rows 0 to 2 True
rows 0 to 3 True
rows 0 to 4 True
rows 1 to 2 False
rows 1 to 3 True
rows 1 to 4 True
rows 2 to 3 True
rows 2 to 4 True
rows 3 to 4 True
I am not too sure if I understood your question correctly, but if you are looking to put multiple conditions within a dataframe, you can consider this approach:
new_df = df[(df["X"] > 0) & (df["Y"] < 0)]
The & condition is for AND, while replacing that with | is for OR condition. Do remember to put the different conditions in ().
Lastly, if you want to remove duplicates, you can use this
new_df.drop_duplicates()
You can find more information about this function at here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Hope my answer is useful to you.

Categories