Supposing I have two Pandas series:
s1 = pandas.Series([1,2,3])
s2 = pandas.Series([3,1,2])
Is there a good way to equate them in a column x row-style? i.e. I want a DataFrame output that is the result of doing
1 == 3, 2 == 3, 3 == 3
1 == 1, 2 == 1, 3 == 1
1 == 2, 2 == 2, 3 == 2
With the expected output of
False False True
True False False
False True False
I understand that I could expand the two series out into dataframes in their own right and then equate those data frames, but then my peak memory usage will double. I could also loop through one series and equate each individual value to the other series, and then stack those output series together into a DataFrame, and I'll do that if I have to. But it feels like there should be a way to this.
You can take advantage of broadcasting
res = s1[:,None] == s2[None,:]
You can do it using numpy.outer:
pd.DataFrame(np.outer(s1,1/s2) == 1, s1, s2)
s2 3 1 2
s1
1 False True False
2 False False True
3 True False False
Easy to do with apply
out = s1.apply(lambda x : s2==x)
Out[31]:
0 1 2
0 False True False
1 False False True
2 True False False
Related
Say I have a dataframe. (Original dataframe has 91 columns 1000 rows)
0 1 2 3
0 False False False True
1 True False False False
2 True False False False
3 False False True False
4 False True True False
5 False False False False
6 True True True True
I need to get the AND/OR values for all the columns in my dataframe. So the resultant OR, AND values would be.
OR AND
0 True False
1 True False
2 True False
3 True False
4 True False
5 False False
6 True True
I can do this by looping over all my columns and calculate the boolean for each column but I was looking for a more dataframe level approach without actually going through the columns.
You can use any and all.
df = df.assign(OR=df.any(axis=1), AND=df.all(axis=1))
You can sum along the columns and then the OR is indicated by sum > 0, and AND is indicated by sum == len(df.columns):
total = df.sum(axis=1)
res = pd.DataFrame({"OR": total > 0, "AND": total == len(df.columns)})
If you have many columns this is more efficient as it only iterates over the entire matrix once (in the worst case, depending on the input distribution and implementation of any/all iterating twice can be faster).
I have a dataset, which has two columns:
index Value
0 True
1 True
2 False
3 True
Is it possible to obtain a matrix that looks like
index 0 1 2 3
0 True True False True
1 True True False True
2 False False False False
3 True True False True
I tried pd.crosstab, still not able to get the matrix, can anyone please help?
A possible way:
m = np.tile(df['Value'], len(df)).reshape(-1, len(df)) * df[['Value']].values
out = pd.DataFrame(m)
print(out)
# Output
0 1 2 3
0 True True False True
1 True True False True
2 False False False False
3 True True False True
First, convert the values of Value columns to a numpy array using to_numpy. Then take advantage of numpy broadcasting by creating an extra axis with [:,None] and computing the bitwise and operation:
vals = df['Value'].to_numpy()
res = pd.DataFrame(vals[:,None] & vals, index=df.index)
Output:
>>> res
0 1 2 3
index
0 True True False True
1 True True False True
2 False False False False
3 True True False True
I would need to assign boolean values to rows in a new column Y based on the value of a column called X (1,2,3,4,5).
I have this column in a dataset df:
X
1
1
1
3
2
5
2
4
1
I would like a new one, Y, in a new dataset that is a copy of df, where:
if row has X value = 1 then True
if row has X value = 2 then False
if row has X value = 3 then False
if row has X value = 4 then True
if row has X value = 5 then False
So I should have
X Y
1 true
1 true
1 true
3 false
2 false
5 false
2 false
4 true
1 true
I wrote this code:
new_df=df.copy()
new_df['Y'] = False
for index in df.iterrows():
if df['X'] == 1:
new_df.iloc[index,9] = True
elif df['X'] == 2:
new_df.iloc[index,9] = False
elif df['X'] == 3:
new_df.iloc[index,9] = False
elif df['X'] == 4:
new_df.iloc[index,9] = True
else:
new_df.iloc[index,9] = False
getting this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Can you please help me to fix the code to get the expected output? Thank you
Edit: np.where() is preferred to map()
I believe what you need to do is to create a custom function where you can use the if-elif-else and then use map with it. Something along the lines of:
def evaluator(x):
if x == 1:
return True
elif x == 2:
return False
elif x == 3:
return False
elif x == 4:
return True
else:
return False
df['Y'] = df['X'].map(lambda x: evaluator(x))
#Allolz comment provides a useful simplification which can also allow for the use of vectorized operation with np.where()
df['Y'] = np.where(df['X'].isin([1,4]),True,False)
This, in your case and given your input dataframe, outputs:
X Y
0 1 True
1 1 True
2 1 True
3 3 False
4 2 False
5 5 False
6 2 False
7 4 True
8 1 True
given
patient_id test_result has_cancer
0 79452 Negative False
1 81667 Positive True
2 76297 Negative False
3 36593 Negative False
4 53717 Negative False
5 67134 Negative False
6 40436 Negative False
how to count False or True in a column , in python?
I had been trying:
# number of patients with cancer
number_of_patients_with_cancer= (df["has_cancer"]==True).count()
print(number_of_patients_with_cancer)
So you need value_counts ?
df.col_name.value_counts()
Out[345]:
False 6
True 1
Name: has_cancer, dtype: int64
If has_cancer has NaNs:
false_count = (~df.has_cancer).sum()
If has_cancer does not have NaNs, you can optimise by not having to negate the masks beforehand.
false_count = len(df) - df.has_cancer.sum()
And similarly, if you want just the count of True values, that is
true_count = df.has_cancer.sum()
If you want both, it is
fc, tc = df.has_cancer.value_counts().sort_index().tolist()
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
If the panda series above is called example
example.sum()
Then this code outputs 1 since there is only one True value in the series. To get the count of False
len(example) - example.sum()
number_of_patients_with_cancer = df.has_cancer[df.has_cancer==True].count()
Consider your above data frame as a df
True_Count = df[df.has_cancer == True]
len(True_Count)
Just sum the column for a count of the Trues. False is just a special case of 0 and True a special case of 1. The False count would be your row count minus that. Unless you've got na's in there.
I have a list of particle pairs within which each pair is referred to by a combination of a chain-index and an intra-chain-index of both particles. I have saved those in a Dataframe (let's call it index_array) and now I want to plot a matrix of all particle pairs, where I plot all matrix elements that correspond to a pair in the list in one color and all others in another color. My idea was thus to produce a DataFrame (let's call it to_fill) with chain- and intra-chain-index as a MultiIndex for both rows and columns, which thus has two entries per pair and then use index_array to index to_fill to change the corresponding values, such that I can then plot the values of to_fill via matplotlib.pyplot.pcolormesh.
So to break it down into a more or less well-defined problem: I have a boolean DataFrame to_fill that has multiindexed rows and columns (2 levels each) that contains only Falses. I also have another DataFrame index_array that has four columns, containing the index values for the levels of both rows and columns. Now I want to set all elements pointed to by index_array to True. A toy version of those could for example be produced with the code below:
import numpy as np
import pandas as pd
lengths = pd.Series(data=[2, 4], index=[1, 2]) # Corresponds to the chains' lengths
index = pd.MultiIndex.from_tuples([(i, j) for i in lengths.index
for j in np.arange(1, lengths.loc[i]+1)])
to_fill = pd.DataFrame(index=index, columns=index, dtype=np.bool)
to_fill.loc[slice(None), slice(None)] = 0
print(to_fill)
# 1 2
# 1 2 1 2 3 4
# 1 1 False False False False False False
# 2 False False False False False False
# 2 1 False False False False False False
# 2 False False False False False False
# 3 False False False False False False
# 4 False False False False False False
index_array = pd.DataFrame([[1, 1, 1, 1],
[1, 1, 1, 2],
[2, 3, 2, 3],
[2, 3, 2, 4]],
columns=["i_1", "j_1", "i_2", "j_2"])
print(index_array)
# i_1 j_1 i_2 j_2
# 0 1 1 1 1
# 1 1 1 1 2
# 2 2 3 2 3
# 3 2 3 2 4
Now I want to set all entries in to_fill that correspond to (i_1, j_1), (i_2, j_2) for a row in index_array to True. So basically, index_array refers to entries in to_fill that should be changed. The expected result would thus be:
print(to_fill)
# 1 2
# 1 2 1 2 3 4
# 1 1 True True False False False False
# 2 False False False False False False
# 2 1 False False False False False False
# 2 False False False False False False
# 3 False False False False True True
# 4 False False False False False False
But I did not manage to properly use index_array as an index. How can I tell to_fill to treat the indexing arrays i_1, j_1, i_2, and j_2 as corresponding index values for the levels of the row and column MultiIndex respectively?
This is a little better - hmm perhaps not really:
tuples = [tuple(x) for x in index_array.values]
stacked = to_fill.stack(level=0).stack() # double stack carefully ordered
stacked.loc[tuples] = True
result = stacked.unstack(level=2).unstack().dropna(axis=1) #unstack and drop NaN cols
This is not great as I don't seek to use iterrows() if it can be helped.
idx = pd.IndexSlice
for row in index_array.iterrows():
r = row[1]
i_1= r.loc['i_1']
j_1= r.loc['j_1']
i_2= r.loc['i_2']
j_2 = r.loc['j_2']
to_fill.loc[idx[i_1,j_1],idx[i_2,j_2]] = True