How to get pandas dataframe series name given a column value? - python

I have a python pandas dataframe with a bunch of names and series, and I create a final column where I sum up the series. I want to get just the row name where the sum of the series equals 0, so I can then later delete those rows. My dataframe is as follows (the last column I create just to sum up the series):
1 2 3 4 total
Ash 1 0 1 1 3
Bel 0 0 0 0 0
Cay 1 0 0 0 1
Jeg 0 1 1 1 3
Jut 1 1 1 1 4
Based on the last column, the series "Bel" is 0, so I want to be able to print out that name only, and then later I can delete that row or keep a record of these rows.
This is my code so far:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for values in df['total']:
if values == 0:
print(df.index[values)
But this obviously is wrong because I am passing the index of 0 to this loop, which will always print the name of the first row. Not sure what method I can implement here though?
There are great solutions below and I also found a way using a simpler python skill, enumerate (because I still find list comprehension hard to write):
def check_empty(df):
df['total'] = df.sum(axis=1)
for name, values in enumerate(df['total']):
if values == 0:
print(df.index[name])

One possible way may be following where df is filtered using value in total:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
index = df[df['total'] == 0].index.values.tolist()
print(index)
If you would like to iterate through row then, using df.iterrows() may be other way as well:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for index, row in df.iterrows():
if row['total'] == 0:
print(index)

Another option is np.where.
import numpy as np
df.iloc[np.where(df.loc[:, 'total'] == 0)]
Output:
1 2 3 4 total
Bel 0 0 0 0 0

Related

Pandas dataframe row counts to plt incrementally after every 4 consecutive rows

I am trying to assign as ID to a pandas dataframe based on row count. For this I am trying to apply the below logic to pandas dataframe:
num = df.shape[0]
for i in range(num):
print(math.ceil(i/4))
So the idea is that for every 4 consecutive rows, an ID would be assigned. So the resultant dataframe would look like
col_1 Group_ID
v_1 1
v_2 1
v_3 1
v_4 1
v_5 2
v_6 2
v_7 2
v_8 2
v_9 3
v_10 3
--- And so on.
Just a quick thought. How can I use apply function on df.index.
Can I use the below code?
df['Index'] = df.index
df[GroupID] = df['Index].apply(np.ceil)
Any hints?
You can pass a function to apply, so create a named function and pass it
def everyFour(rowIdx):
return math.ceil(rowIdx / 4)
df['GroupId'] = df['Index'].apply(everyFour)
or just use a lambda
df['GroupId'] = df['Index'].apply(lambda rowIdx: math.ceil(rowIdx / 4))
Note that this will leave the first row with index 0 at 0, so you might want to add 1 to the rowIndex before dividing by 4.

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.
First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")
You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

Append output from code to row it came from in python dataframe

Ive got some code which is working to create the output but id like to tack it back onto the row it came from, I'm struggling to do it with a join because there isn't a unique identifier so ideally would like to clean this step up.
first bit of code creates recent_transactions and then id like to append onto the end of each row the output from class_df
recent_transactions = pd.DataFrame(data)
recent_transactions
class_df = pd.DataFrame(json.loads(i['unknownStatementItem']) for i in recent_transactions.itemClassData)
You can create a unique identifier by using the index entry of a dataframe that has no duplicate index numbers. You can reset the index to guarantee a clean index with no duplicates. Ideally you can use vectorization rather than iteration, but at worst, get the unique index number from the dataframe and use it to find the correct row for setting a new column to the function's output.
row1list = [1, 2]
df = pd.DataFrame([row1list],
columns=['a', 'b'])
df = df.append(df) # duplicate index numbers, so clean that up next
print(df)
# a b
# 0 1 2
# 0 1 2
df = df.reset_index(drop=True).reset_index()
# drop old index with duplicates, make new clean 'index' and make it available as a column
print(df)
# index a b
# 0 0 1 2
# 1 1 1 2
df['result_of_some_function'] = -1 # start with a bogus value, upgrade as appropriate
for i in range(len(df)):
result_of_some_function = i * 2
df.loc[df['index'] == i, 'result_of_some_function'] = result_of_some_function
print(df)
# index a b result_of_some_function
# 0 0 1 2 0
# 1 1 1 2 2

What is the pythonic way to do a conditional count across pandas dataframe rows with apply?

I'm trying to do a conditional count across records in a pandas dataframe. I'm new at Python and have a working solution using a for loop, but running this on a large dataframe with ~200k rows takes a long time and I believe there is a better way to do this by defining a function and using apply, but I'm having trouble figuring it out.
Here's a simple example.
Create a pandas dataframe with two columns:
import pandas as pd
data = {'color': ['blue','green','yellow','blue','green','yellow','orange','purple','red','red'],
'weight': [4,5,6,4,1,3,9,8,4,1]
}
df = pd.DataFrame(data)
# for each row, count the number of other rows with the same color and a lesser weight
counts = []
for i in df.index:
c = df.loc[i, 'color']
w = df.loc[i, 'weight']
ct = len(df.loc[(df['color']==c) & (df['weight']<w)])
counts.append(ct)
df['counts, same color & less weight'] = counts
For each record, the 'counts, same color & less weight' column is intended to get a count of the other records in the df with the same color and a lesser weight. For example, the result for row 0 (blue, 4) is zero because no other records with color=='blue' have lesser weight. The result for row 1 (green, 5) is 1 because row 4 is also color=='green' but weight==1.
How do I define a function that can be applied to the dataframe to achieve the same?
I'm familiar with apply, for example to square the weight column I'd use:
df['weight squared'] = df['weight'].apply(lambda x: x**2)
... but I'm unclear how to use apply to do a conditional calculation that refers to the entire df.
Thanks in advance for any help.
We can do transform with min groupby
df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
0 0
1 1
2 1
3 0
4 0
5 0
6 0
7 0
8 1
9 0
Name: weight, dtype: int64
#df['c...]=df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)

In a DataFrame, how could we get a list of indexes with 0's in specific columns?

We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0

Categories