given
patient_id test_result has_cancer
0 79452 Negative False
1 81667 Positive True
2 76297 Negative False
3 36593 Negative False
4 53717 Negative False
5 67134 Negative False
6 40436 Negative False
how to count False or True in a column , in python?
I had been trying:
# number of patients with cancer
number_of_patients_with_cancer= (df["has_cancer"]==True).count()
print(number_of_patients_with_cancer)
So you need value_counts ?
df.col_name.value_counts()
Out[345]:
False 6
True 1
Name: has_cancer, dtype: int64
If has_cancer has NaNs:
false_count = (~df.has_cancer).sum()
If has_cancer does not have NaNs, you can optimise by not having to negate the masks beforehand.
false_count = len(df) - df.has_cancer.sum()
And similarly, if you want just the count of True values, that is
true_count = df.has_cancer.sum()
If you want both, it is
fc, tc = df.has_cancer.value_counts().sort_index().tolist()
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
If the panda series above is called example
example.sum()
Then this code outputs 1 since there is only one True value in the series. To get the count of False
len(example) - example.sum()
number_of_patients_with_cancer = df.has_cancer[df.has_cancer==True].count()
Consider your above data frame as a df
True_Count = df[df.has_cancer == True]
len(True_Count)
Just sum the column for a count of the Trues. False is just a special case of 0 and True a special case of 1. The False count would be your row count minus that. Unless you've got na's in there.
Related
Say I have a dataframe. (Original dataframe has 91 columns 1000 rows)
0 1 2 3
0 False False False True
1 True False False False
2 True False False False
3 False False True False
4 False True True False
5 False False False False
6 True True True True
I need to get the AND/OR values for all the columns in my dataframe. So the resultant OR, AND values would be.
OR AND
0 True False
1 True False
2 True False
3 True False
4 True False
5 False False
6 True True
I can do this by looping over all my columns and calculate the boolean for each column but I was looking for a more dataframe level approach without actually going through the columns.
You can use any and all.
df = df.assign(OR=df.any(axis=1), AND=df.all(axis=1))
You can sum along the columns and then the OR is indicated by sum > 0, and AND is indicated by sum == len(df.columns):
total = df.sum(axis=1)
res = pd.DataFrame({"OR": total > 0, "AND": total == len(df.columns)})
If you have many columns this is more efficient as it only iterates over the entire matrix once (in the worst case, depending on the input distribution and implementation of any/all iterating twice can be faster).
I have a dataset, which has two columns:
index Value
0 True
1 True
2 False
3 True
Is it possible to obtain a matrix that looks like
index 0 1 2 3
0 True True False True
1 True True False True
2 False False False False
3 True True False True
I tried pd.crosstab, still not able to get the matrix, can anyone please help?
A possible way:
m = np.tile(df['Value'], len(df)).reshape(-1, len(df)) * df[['Value']].values
out = pd.DataFrame(m)
print(out)
# Output
0 1 2 3
0 True True False True
1 True True False True
2 False False False False
3 True True False True
First, convert the values of Value columns to a numpy array using to_numpy. Then take advantage of numpy broadcasting by creating an extra axis with [:,None] and computing the bitwise and operation:
vals = df['Value'].to_numpy()
res = pd.DataFrame(vals[:,None] & vals, index=df.index)
Output:
>>> res
0 1 2 3
index
0 True True False True
1 True True False True
2 False False False False
3 True True False True
Supposing I have two Pandas series:
s1 = pandas.Series([1,2,3])
s2 = pandas.Series([3,1,2])
Is there a good way to equate them in a column x row-style? i.e. I want a DataFrame output that is the result of doing
1 == 3, 2 == 3, 3 == 3
1 == 1, 2 == 1, 3 == 1
1 == 2, 2 == 2, 3 == 2
With the expected output of
False False True
True False False
False True False
I understand that I could expand the two series out into dataframes in their own right and then equate those data frames, but then my peak memory usage will double. I could also loop through one series and equate each individual value to the other series, and then stack those output series together into a DataFrame, and I'll do that if I have to. But it feels like there should be a way to this.
You can take advantage of broadcasting
res = s1[:,None] == s2[None,:]
You can do it using numpy.outer:
pd.DataFrame(np.outer(s1,1/s2) == 1, s1, s2)
s2 3 1 2
s1
1 False True False
2 False False True
3 True False False
Easy to do with apply
out = s1.apply(lambda x : s2==x)
Out[31]:
0 1 2
0 False True False
1 False False True
2 True False False
I have a Pandas DataFrame and I want to find all rows where the i'th column values are 10 times greater than other columns.
Here is an example of my DataFrame:
For example, looking at column i=0, row B (0.344) its is 10x greater than values in the same row but in other columns (0.001, 0, 0.009, 0). So I would like:
my_list_0=[False,True,False,False,False,False,False,False,False,False,False]
The number of columns might change hence I don't want a solution like:
#This is good only for a DataFrame with 4 columns.
my_list_i = data.loc[(data.iloc[:,i]>10*data.iloc[:,(i+1)%num_cols]) &
(data.iloc[:,i]>10*data.iloc[:,(i+2)%num_cols]) &
(data.iloc[:,i]>10*data.iloc[:,(i+3)%num_cols])]
Any idea?
thanks.
Given the df:
df = pd.DataFrame({'cell1':[0.006209, 0.344955, 0.004521, 0, 0.018931, 0.439725, 0.013195, 0.009045, 0, 0.02614, 0],
'cell2':[0.048043, 0.001077, 0,0.010393, 0.031546, 0.287264, 0.016732, 0.030291, 0.016236, 0.310639,0],
'cell3':[0,0,0.020238, 0, 0.03811, 0.579348, 0.005906, 0,0,0.068352, 0.030165],
'cell4':[0.016139, 0.009359, 0,0,0.025449, 0.47779, 0, 0.01282, 0.005107, 0.004846, 0],
'cell5': [0,0,0,0.012075, 0.031668, 0.520258, 0,0,0,2.728218, 0.013418]})
i = 0
You can use
(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).all(1)
To get
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
dtype: bool
for any number of columns. This drops column i, multiplies the remaining df by 10, and checks row-wise for being less than i, then returns True only if all values in the row are True. So it returns a vector of True for each row where this obtains and False for others.
If you want to give an arbitrary threshold, you can sum the Trues and divide by the number of columns - 1, then compare with your threshold:
thresh = 0.5 # or whatever you want
(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).sum(1) / (df.shape[1] - 1) > thresh
0 False
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
dtype: bool
I have a list of particle pairs within which each pair is referred to by a combination of a chain-index and an intra-chain-index of both particles. I have saved those in a Dataframe (let's call it index_array) and now I want to plot a matrix of all particle pairs, where I plot all matrix elements that correspond to a pair in the list in one color and all others in another color. My idea was thus to produce a DataFrame (let's call it to_fill) with chain- and intra-chain-index as a MultiIndex for both rows and columns, which thus has two entries per pair and then use index_array to index to_fill to change the corresponding values, such that I can then plot the values of to_fill via matplotlib.pyplot.pcolormesh.
So to break it down into a more or less well-defined problem: I have a boolean DataFrame to_fill that has multiindexed rows and columns (2 levels each) that contains only Falses. I also have another DataFrame index_array that has four columns, containing the index values for the levels of both rows and columns. Now I want to set all elements pointed to by index_array to True. A toy version of those could for example be produced with the code below:
import numpy as np
import pandas as pd
lengths = pd.Series(data=[2, 4], index=[1, 2]) # Corresponds to the chains' lengths
index = pd.MultiIndex.from_tuples([(i, j) for i in lengths.index
for j in np.arange(1, lengths.loc[i]+1)])
to_fill = pd.DataFrame(index=index, columns=index, dtype=np.bool)
to_fill.loc[slice(None), slice(None)] = 0
print(to_fill)
# 1 2
# 1 2 1 2 3 4
# 1 1 False False False False False False
# 2 False False False False False False
# 2 1 False False False False False False
# 2 False False False False False False
# 3 False False False False False False
# 4 False False False False False False
index_array = pd.DataFrame([[1, 1, 1, 1],
[1, 1, 1, 2],
[2, 3, 2, 3],
[2, 3, 2, 4]],
columns=["i_1", "j_1", "i_2", "j_2"])
print(index_array)
# i_1 j_1 i_2 j_2
# 0 1 1 1 1
# 1 1 1 1 2
# 2 2 3 2 3
# 3 2 3 2 4
Now I want to set all entries in to_fill that correspond to (i_1, j_1), (i_2, j_2) for a row in index_array to True. So basically, index_array refers to entries in to_fill that should be changed. The expected result would thus be:
print(to_fill)
# 1 2
# 1 2 1 2 3 4
# 1 1 True True False False False False
# 2 False False False False False False
# 2 1 False False False False False False
# 2 False False False False False False
# 3 False False False False True True
# 4 False False False False False False
But I did not manage to properly use index_array as an index. How can I tell to_fill to treat the indexing arrays i_1, j_1, i_2, and j_2 as corresponding index values for the levels of the row and column MultiIndex respectively?
This is a little better - hmm perhaps not really:
tuples = [tuple(x) for x in index_array.values]
stacked = to_fill.stack(level=0).stack() # double stack carefully ordered
stacked.loc[tuples] = True
result = stacked.unstack(level=2).unstack().dropna(axis=1) #unstack and drop NaN cols
This is not great as I don't seek to use iterrows() if it can be helped.
idx = pd.IndexSlice
for row in index_array.iterrows():
r = row[1]
i_1= r.loc['i_1']
j_1= r.loc['j_1']
i_2= r.loc['i_2']
j_2 = r.loc['j_2']
to_fill.loc[idx[i_1,j_1],idx[i_2,j_2]] = True