Equate a column vector and a row vector with Pandas - python

Supposing I have two Pandas series:
s1 = pandas.Series([1,2,3])
s2 = pandas.Series([3,1,2])
Is there a good way to equate them in a column x row-style? i.e. I want a DataFrame output that is the result of doing
1 == 3, 2 == 3, 3 == 3
1 == 1, 2 == 1, 3 == 1
1 == 2, 2 == 2, 3 == 2
With the expected output of
False False True
True False False
False True False
I understand that I could expand the two series out into dataframes in their own right and then equate those data frames, but then my peak memory usage will double. I could also loop through one series and equate each individual value to the other series, and then stack those output series together into a DataFrame, and I'll do that if I have to. But it feels like there should be a way to this.

You can take advantage of broadcasting
res = s1[:,None] == s2[None,:]

You can do it using numpy.outer:
pd.DataFrame(np.outer(s1,1/s2) == 1, s1, s2)
s2 3 1 2
s1
1 False True False
2 False False True
3 True False False

Easy to do with apply
out = s1.apply(lambda x : s2==x)
Out[31]:
0 1 2
0 False True False
1 False False True
2 True False False

Related

Pandas : Get binary OR/AND for all the columns in a dataframe

Say I have a dataframe. (Original dataframe has 91 columns 1000 rows)
0 1 2 3
0 False False False True
1 True False False False
2 True False False False
3 False False True False
4 False True True False
5 False False False False
6 True True True True
I need to get the AND/OR values for all the columns in my dataframe. So the resultant OR, AND values would be.
OR AND
0 True False
1 True False
2 True False
3 True False
4 True False
5 False False
6 True True
I can do this by looping over all my columns and calculate the boolean for each column but I was looking for a more dataframe level approach without actually going through the columns.
You can use any and all.
df = df.assign(OR=df.any(axis=1), AND=df.all(axis=1))
You can sum along the columns and then the OR is indicated by sum > 0, and AND is indicated by sum == len(df.columns):
total = df.sum(axis=1)
res = pd.DataFrame({"OR": total > 0, "AND": total == len(df.columns)})
If you have many columns this is more efficient as it only iterates over the entire matrix once (in the worst case, depending on the input distribution and implementation of any/all iterating twice can be faster).

How to make a matrix using index and the value in python?

I have a dataset, which has two columns:
index Value
0 True
1 True
2 False
3 True
Is it possible to obtain a matrix that looks like
index 0 1 2 3
0 True True False True
1 True True False True
2 False False False False
3 True True False True
I tried pd.crosstab, still not able to get the matrix, can anyone please help?
A possible way:
m = np.tile(df['Value'], len(df)).reshape(-1, len(df)) * df[['Value']].values
out = pd.DataFrame(m)
print(out)
# Output
0 1 2 3
0 True True False True
1 True True False True
2 False False False False
3 True True False True
First, convert the values of Value columns to a numpy array using to_numpy. Then take advantage of numpy broadcasting by creating an extra axis with [:,None] and computing the bitwise and operation:
vals = df['Value'].to_numpy()
res = pd.DataFrame(vals[:,None] & vals, index=df.index)
Output:
>>> res
0 1 2 3
index
0 True True False True
1 True True False True
2 False False False False
3 True True False True

Assigning boolean value to a new column based on conditions

I would need to assign boolean values to rows in a new column Y based on the value of a column called X (1,2,3,4,5).
I have this column in a dataset df:
X
1
1
1
3
2
5
2
4
1
I would like a new one, Y, in a new dataset that is a copy of df, where:
if row has X value = 1 then True
if row has X value = 2 then False
if row has X value = 3 then False
if row has X value = 4 then True
if row has X value = 5 then False
So I should have
X Y
1 true
1 true
1 true
3 false
2 false
5 false
2 false
4 true
1 true
I wrote this code:
new_df=df.copy()
new_df['Y'] = False
for index in df.iterrows():
if df['X'] == 1:
new_df.iloc[index,9] = True
elif df['X'] == 2:
new_df.iloc[index,9] = False
elif df['X'] == 3:
new_df.iloc[index,9] = False
elif df['X'] == 4:
new_df.iloc[index,9] = True
else:
new_df.iloc[index,9] = False
getting this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Can you please help me to fix the code to get the expected output? Thank you
Edit: np.where() is preferred to map()
I believe what you need to do is to create a custom function where you can use the if-elif-else and then use map with it. Something along the lines of:
def evaluator(x):
if x == 1:
return True
elif x == 2:
return False
elif x == 3:
return False
elif x == 4:
return True
else:
return False
df['Y'] = df['X'].map(lambda x: evaluator(x))
#Allolz comment provides a useful simplification which can also allow for the use of vectorized operation with np.where()
df['Y'] = np.where(df['X'].isin([1,4]),True,False)
This, in your case and given your input dataframe, outputs:
X Y
0 1 True
1 1 True
2 1 True
3 3 False
4 2 False
5 5 False
6 2 False
7 4 True
8 1 True

Count occurrences of False or True in a column in pandas

given
patient_id test_result has_cancer
0 79452 Negative False
1 81667 Positive True
2 76297 Negative False
3 36593 Negative False
4 53717 Negative False
5 67134 Negative False
6 40436 Negative False
how to count False or True in a column , in python?
I had been trying:
# number of patients with cancer
number_of_patients_with_cancer= (df["has_cancer"]==True).count()
print(number_of_patients_with_cancer)
So you need value_counts ?
df.col_name.value_counts()
Out[345]:
False 6
True 1
Name: has_cancer, dtype: int64
If has_cancer has NaNs:
false_count = (~df.has_cancer).sum()
If has_cancer does not have NaNs, you can optimise by not having to negate the masks beforehand.
false_count = len(df) - df.has_cancer.sum()
And similarly, if you want just the count of True values, that is
true_count = df.has_cancer.sum()
If you want both, it is
fc, tc = df.has_cancer.value_counts().sort_index().tolist()
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
If the panda series above is called example
example.sum()
Then this code outputs 1 since there is only one True value in the series. To get the count of False
len(example) - example.sum()
number_of_patients_with_cancer = df.has_cancer[df.has_cancer==True].count()
Consider your above data frame as a df
True_Count = df[df.has_cancer == True]
len(True_Count)
Just sum the column for a count of the Trues. False is just a special case of 0 and True a special case of 1. The False count would be your row count minus that. Unless you've got na's in there.

Index DataFrame with MultiIndex Rows and Columns via another DataFrame containing row and column indices as columns

I have a list of particle pairs within which each pair is referred to by a combination of a chain-index and an intra-chain-index of both particles. I have saved those in a Dataframe (let's call it index_array) and now I want to plot a matrix of all particle pairs, where I plot all matrix elements that correspond to a pair in the list in one color and all others in another color. My idea was thus to produce a DataFrame (let's call it to_fill) with chain- and intra-chain-index as a MultiIndex for both rows and columns, which thus has two entries per pair and then use index_array to index to_fill to change the corresponding values, such that I can then plot the values of to_fill via matplotlib.pyplot.pcolormesh.
So to break it down into a more or less well-defined problem: I have a boolean DataFrame to_fill that has multiindexed rows and columns (2 levels each) that contains only Falses. I also have another DataFrame index_array that has four columns, containing the index values for the levels of both rows and columns. Now I want to set all elements pointed to by index_array to True. A toy version of those could for example be produced with the code below:
import numpy as np
import pandas as pd
lengths = pd.Series(data=[2, 4], index=[1, 2]) # Corresponds to the chains' lengths
index = pd.MultiIndex.from_tuples([(i, j) for i in lengths.index
for j in np.arange(1, lengths.loc[i]+1)])
to_fill = pd.DataFrame(index=index, columns=index, dtype=np.bool)
to_fill.loc[slice(None), slice(None)] = 0
print(to_fill)
# 1 2
# 1 2 1 2 3 4
# 1 1 False False False False False False
# 2 False False False False False False
# 2 1 False False False False False False
# 2 False False False False False False
# 3 False False False False False False
# 4 False False False False False False
index_array = pd.DataFrame([[1, 1, 1, 1],
[1, 1, 1, 2],
[2, 3, 2, 3],
[2, 3, 2, 4]],
columns=["i_1", "j_1", "i_2", "j_2"])
print(index_array)
# i_1 j_1 i_2 j_2
# 0 1 1 1 1
# 1 1 1 1 2
# 2 2 3 2 3
# 3 2 3 2 4
Now I want to set all entries in to_fill that correspond to (i_1, j_1), (i_2, j_2) for a row in index_array to True. So basically, index_array refers to entries in to_fill that should be changed. The expected result would thus be:
print(to_fill)
# 1 2
# 1 2 1 2 3 4
# 1 1 True True False False False False
# 2 False False False False False False
# 2 1 False False False False False False
# 2 False False False False False False
# 3 False False False False True True
# 4 False False False False False False
But I did not manage to properly use index_array as an index. How can I tell to_fill to treat the indexing arrays i_1, j_1, i_2, and j_2 as corresponding index values for the levels of the row and column MultiIndex respectively?
This is a little better - hmm perhaps not really:
tuples = [tuple(x) for x in index_array.values]
stacked = to_fill.stack(level=0).stack() # double stack carefully ordered
stacked.loc[tuples] = True
result = stacked.unstack(level=2).unstack().dropna(axis=1) #unstack and drop NaN cols
This is not great as I don't seek to use iterrows() if it can be helped.
idx = pd.IndexSlice
for row in index_array.iterrows():
r = row[1]
i_1= r.loc['i_1']
j_1= r.loc['j_1']
i_2= r.loc['i_2']
j_2 = r.loc['j_2']
to_fill.loc[idx[i_1,j_1],idx[i_2,j_2]] = True

Categories