Sum different rows in a data frame based on multiple conditions - python

I have the following data frame:
dataset = {
'String': ["ABCDEF","HIJABC","ABCHIJ","DEFABC"],
'Bool':[True,True,False,False],
'Number':[10,20,40,50]}
df = pd.DataFrame(dataset)
String Bool Number
0 ABCDEF True 10
1 HIJABC True 20
2 ABCHIJ False 40
3 DEFABC False 50
I would like to sum the rows of the column Number where Bool is False to the rows where Bool is True:
The rows can be matched and summed together if the reverse of String of one row is not equal to the String of the row.
In this case ABCHIJ where bool is False is not equal to the reverse of ABCDEF so I sum the numbers: 10+40.
HIJABC where Bool is True is summed to DEFABC where Bool is False the outcome is 70
Expected output:
String Bool Number
0 ABCDEF True 50
1 HIJABC True 70
2 ABCHIJ False 40
3 DEFABC False 50
I hope my explanation was good enough, is there a way to achieve the above outcome ?

One way is like this:
df_true = df[df['Bool'] == True]
df_false = df[df['Bool'] == False]
for i in df_false['String']:
idx = df_true[df_true['String'] != (i[3:] + i[:3]) ].index[0]
current_num = df.loc[df.index == idx, 'Number'].values[0]
added_num = df[df['String'] == i]['Number'].values[0]
df.loc[df.index == idx, 'Number'] = current_num + added_num
I hope it helps

Related

Compare Python dataframe columns to variables containing None values

I'm comparing a dataframe to multiple variables as follows...
if not ( (df['Column1'] == variableA) & (df['Column2'] == variableB) & (df['Column3'] == variableC) ).any():
print("row not in dataframe")
Both my dataframe and variables may contain a None value, but I've noticed that when 2 None values are compared, they do not return True and so I'm printing "not in list" even though both df and variables hold the same (None) value.
Any thoughts on the best way around this issue would be really appreciated. Perhaps I have to convert the None values to a string that will return True when compared?
The reason is that asserting equality for None is not possible with a simple equality operator.
Consider the following example:
s = pd.Series([1, "a", None])
s == 1
Out[4]:
0 True
1 False
2 False
dtype: bool
s == "a"
Out[5]:
0 False
1 True
2 False
dtype: bool
s == None
Out[6]:
0 False
1 False
2 False
dtype: bool
s.isna()
Out[7]:
0 False
1 False
2 True
dtype: bool
So if you want to collect any potential equality between Nones, you need to check whether the values are na.
If you have two series (eg two cols of a dataframe), you will need to create a union of results as:
d = pd.Series([2, "a", None])
s == d
Out[12]:
0 False
1 True
2 False
dtype: bool
(s == d) | (s.isna() & d.isna())
Out[13]:
0 False
1 True
2 True
dtype: bool
This means that the solution for your problem would be something like:
(df['Column1'] == variableA) & (df['Column2'] == variableB) & (df['Column3'] == variableC)
|
(df['Column1'].isna() & df['Column2'].isna() & df['Column3'].isna())

Find empty cells in rows of a column | Dataframe pandas

My Dataframe has a column named "Teacher" and i want to know in that column the rows that are empty.
Example:
print(df["Teacher"])
0
1
2 Richard
3
4 Richard
Name: Teacher, Length: 5, dtype: object
I know that if i do something like this:
R = ['R']
cond = df['Teacher'].str.startswith(tuple(R))
print(cond)
It prints the rows of that column and tells me in boolean the teacher that starts with the R.
print(cond)
0 False
1 False
2 True
3 False
4 True
Name: Teacher, Length: 5, dtype: object
I want the same for the empty ones, to return True when its empty and false when its not but dont know how.
If empty is missing value or Nones use Series.isna:
cond = df['Teacher'].isna()
If empty is zero or more spaces use Series.str.contains:
cond = df['Teacher'].str.contains(r'^\s*$', na=False)
If empty is empty string compare by it:
cond = df['Teacher'] == ''
df = pd.DataFrame({'Teacher':['',' ', None, np.nan, 'Richard']})
cond1 = df['Teacher'].isna()
cond2 = df['Teacher'].str.contains(r'^\s*$', na=False)
cond3 = df['Teacher'] == ''
df = df.assign(cond1= cond1, cond2= cond2, cond3= cond3)
print (df)
Teacher cond1 cond2 cond3
0 False True True
1 False True False
2 None True False False
3 NaN True False False
4 Richard False False False

Pandas dataframe check if a value exists in multiple columns for one row

I want to print out the row where the value is "True" for more than one column.
For example if data frame is the following:
Remove Ignore Repair
0 True False False
1 False True True
2 False True False
I want it to print:
1
Is there an elegant way to do this instead of bunch of if statements?
you can use sum and pass axis=1 to sum over columns.
import pandas as pd
df = pd.DataFrame({'a':[False, True, False],'b':[False, True, False], 'c':[True, False, False,]})
print(df)
print("Ans: ",df[(df.sum(axis=1)>1)].index.tolist())
output:
a b c
0 False False True
1 True True False
2 False False False
Ans: [1]
To get the first row that meets the criteria:
df.index[df.sum(axis=1).gt(1)][0]
Output:
Out[14]: 1
Since you can get multiple matches, you can exclude the [0] to get all the rows that meet your criteria
You can just sum booleans as they will be interpreted as True=1, False=0:
df.sum(axis=1) > 1
So to filter to rows where this evaluates as True:
df.loc[df.sum(axis=1) > 1]
Or the same thing but being more explicit about converting the booleans to integers:
df.loc[df.astype(int).sum(axis=1) > 1]

How to assert that a pandas data frame filtered on a condition is true

So I have a pytest testing the results of a query that returns pandas dataframe.
I want to assert that a particular column col has all the values that are a substring of a given input.
So this below gives me the rows (dataframe) that have that column's col value containing some input part. How can I assert it to be true?
assert result_df[result_df['col'].astype(str).str.contains(category)].bool == True
doesn't work
Try this:
assert result_df[result_df['col'].astype(str).str.contains(category)].bool.all(axis=None) == True
Please refer to the pandas docs for more info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.all.html
The reason your code doesn't work is because you are trying to test whether the dataframe object is True, not all of the values in it.
I believe you need Series.all for check if all values of filtered Series are Trues:
assert result_df['col'].astype(str).str.contains(category).all()
Sample:
result_df = pd.DataFrame({
'col':list('aaabbb')
})
print (result_df)
col
0 a
1 a
2 a
3 b
4 b
5 b
category = 'b'
assert result_df['col'].astype(str).str.contains(category).all()
AssertionError
Detail:
print (result_df['col'].astype(str).str.contains(category))
0 False
1 False
2 False
3 True
4 True
5 True
Name: col, dtype: bool
print (result_df['col'].astype(str).str.contains(category).all())
False
category = 'a|b'
assert result_df['col'].astype(str).str.contains(category).all()
print (result_df['col'].astype(str).str.contains(category))
0 True
1 True
2 True
3 True
4 True
5 True
Name: col, dtype: bool
print (result_df['col'].astype(str).str.contains(category).all())
True
Found it. assert result_df[result_df['col'].astype(str).str.contains(category)].bool works
or assert result_df['col'].astype(str).str.contains(category).all (Thanks to #jezrael for suggesting all)

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

Categories